GEMINI LABJP
MODEL — Gemma 4 is now available in Google AI Studio and the Gemini APIAGENT — Managed Agents enter public preview, running autonomous agents in isolated sandboxesMODEL — Gemini 3.5 Flash reaches GA for agentic and coding tasksSTUDIO — Google AI Studio adds Workspace integrations and one-click deploy to Cloud RunSTUDIO — You can now build native Android apps in the AI Studio build tabMIGRATE — Gemini Code Assist IDE extensions and CLI ended for individuals on June 18; move to AntigravityMODEL — Gemma 4 is now available in Google AI Studio and the Gemini APIAGENT — Managed Agents enter public preview, running autonomous agents in isolated sandboxesMODEL — Gemini 3.5 Flash reaches GA for agentic and coding tasksSTUDIO — Google AI Studio adds Workspace integrations and one-click deploy to Cloud RunSTUDIO — You can now build native Android apps in the AI Studio build tabMIGRATE — Gemini Code Assist IDE extensions and CLI ended for individuals on June 18; move to Antigravity
Articles/Advanced
Advanced/2026-07-02Advanced

After You Improve the Prompt, How Far Back Do You Regenerate? — Designing a Budget-Bounded Backfill

A prompt improvement only helps future output — thousands of old artifacts stay on the previous generation. This piece covers a budget-bounded backfill: selection scoring, edit-detection hashes, a pre-replacement gate, and a resumable cursor, with working code.

gemini-api259backfilloperations5cost-optimization28pipeline9

Premium Article

The day I finally landed a good revision of the tagging prompt in my app-metadata pipeline, the new tags were noticeably more specific and matched search behavior better. I felt done. Then it hit me: roughly 4,200 artifacts were still sitting there, generated by the old prompt.

A prompt improvement only applies to what you generate next. Everything already produced stays on the previous generation. The obvious answer — "just regenerate everything" — is one I have tried twice as an indie developer, and both times it cost me more than it saved. This article is the design I ended up with instead: a backfill that runs inside a budget, protects human edits, and refuses to replace good output with worse.

Why full regeneration is not the right default

Start with the cost math. In my case, each item averages about 2,800 input tokens (images included) and around 400 output tokens. At gemini-3.5-flash-class pricing, regenerating all 4,200 items lands somewhere in the single-to-low-double-digit dollar range. Purely on price, "run it overnight" looks fine.

The money was never the real problem.

First, overwrites. Some of those artifacts had been corrected by hand — tags fixed after user reports, descriptions reworded for policy reasons. A blind full regeneration reverts every one of them. I caused exactly this incident once, and because I had no record of which items were hand-edited, the cleanup took far longer than the generation cost.

Second, regressions. A prompt revision raises the average, not every individual result. In my measurements, about 8% of regenerated outputs came back worse than the old output under rule-based checks.

Third, quota contention. Backfill traffic consumes the same project rate limits as your steady-state jobs. The night my backfill overlapped the nightly batch, the production side started seeing 429s.

ConcernFull regenerationBudget-bounded backfill
CostAll at once, uncappedCapped by a daily budget
Human editsOverwritten unconditionallySkipped via edit detection
Quality regressionsSlip through (~8% measured)Rejected by a pre-replacement gate
Impact on steady-state jobsCompetes for rate limitsSeparate execution window and pace
InterruptionUnclear state if it dies midwayCursor resumes next day

Once I laid it out this way, the framing changed: backfill is not a one-off event. It is a small, permanent process that keeps ticking in the background.

Scoring what is worth regenerating

If you are not regenerating everything, you need a rule for what goes first. I combine four weighted factors:

  1. Exposure — how often the artifact is actually viewed or used. Improving unused output can wait
  2. Generation gap — how many prompt versions behind the item is; two or more gets priority
  3. Quality signals — reports, low ratings, search mismatches attached to the old output
  4. Cost — heavy inputs (high-resolution images) go later on a tie
# backfill_scorer.py — rank regeneration candidates
# Problem it solves: deciding "what to redo first" by score, not gut feel
import sqlite3
from dataclasses import dataclass
 
WEIGHTS = {"exposure": 0.4, "gen_gap": 0.3, "quality": 0.2, "cost": 0.1}
 
@dataclass
class Candidate:
    item_id: str
    exposure_norm: float   # 0..1, normalized 30-day usage
    gen_gap: int           # current prompt version minus version at generation
    quality_flags: int     # count of reports / low ratings
    input_tokens: int
 
def score(c: Candidate, max_gap: int = 5) -> float:
    gap = min(c.gen_gap, max_gap) / max_gap
    quality = min(c.quality_flags, 3) / 3
    cost_penalty = min(c.input_tokens / 8000, 1.0)  # heavier input, lower priority
    return (
        WEIGHTS["exposure"] * c.exposure_norm
        + WEIGHTS["gen_gap"] * gap
        + WEIGHTS["quality"] * quality
        - WEIGHTS["cost"] * cost_penalty
    )
 
def top_candidates(db: str, limit: int = 500) -> list[tuple[str, float]]:
    conn = sqlite3.connect(db)
    rows = conn.execute(
        """SELECT item_id, exposure_norm, gen_gap, quality_flags, input_tokens
           FROM artifacts WHERE gen_gap >= 1 AND human_edited = 0"""
    ).fetchall()
    conn.close()
    scored = [(r[0], score(Candidate(*r))) for r in rows]
    return sorted(scored, key=lambda x: x[1], reverse=True)[:limit]

Do not agonize over the weights up front. Putting exposure first is the practical choice — only improvements to artifacts people actually see move perceived quality.

I also set a floor: candidates scoring below 0.25 simply do not get regenerated. If you treat "finish the whole backlog" as the goal, budget keeps flowing into regenerations nobody will notice. A backfill is not something you finish; it is something that goes quiet when nothing above the threshold remains, and wakes up again when the next prompt revision widens the generation gap.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A selection scorer that ranks artifacts by exposure, prompt-generation gap, quality signals, and input cost, so regeneration budget flows to what matters
Edit-detection hashes stored at write time that keep the backfill from clobbering human-corrected artifacts
A pre-replacement gate that rejects regressions by comparing old and new output, plus a cursor design that resumes cleanly after the daily budget runs out
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Advanced2026-05-06
Gemini 2.5 Pro + Imagen 4 Content Automation Pipeline: Complete Build Guide
Build a production-ready pipeline combining Gemini 2.5 Pro and Imagen 4 API to auto-generate blog articles, SNS posts, and thumbnails. Covers async processing, quality filters, and monetization design.
Advanced2026-04-20
to Production Architecture for Gemini API 2026— Design Patterns for Building Scalable, Reliable AI Systems
A comprehensive guide to production-grade design patterns for Gemini API. Covers resilient API clients, multi-layer caching, multi-tenant design, observability, and cost control with complete code examples.
Advanced2026-06-29
When Your Gemini Agent Has Three Tool Routes and Quietly Picks the Wrong One
Put Function Calling, Code Execution, and Grounding into one agent and the model will sometimes choose the wrong route, while the output still looks perfectly plausible. Here is how I instrument route selection and correct it with phase separation and verification gates, with working code.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →