●MODEL — Gemma 4 is now available in Google AI Studio and the Gemini API●AGENT — Managed Agents enter public preview, running autonomous agents in isolated sandboxes●MODEL — Gemini 3.5 Flash reaches GA for agentic and coding tasks●STUDIO — Google AI Studio adds Workspace integrations and one-click deploy to Cloud Run●STUDIO — You can now build native Android apps in the AI Studio build tab●MIGRATE — Gemini Code Assist IDE extensions and CLI ended for individuals on June 18; move to Antigravity●MODEL — Gemma 4 is now available in Google AI Studio and the Gemini API●AGENT — Managed Agents enter public preview, running autonomous agents in isolated sandboxes●MODEL — Gemini 3.5 Flash reaches GA for agentic and coding tasks●STUDIO — Google AI Studio adds Workspace integrations and one-click deploy to Cloud Run●STUDIO — You can now build native Android apps in the AI Studio build tab●MIGRATE — Gemini Code Assist IDE extensions and CLI ended for individuals on June 18; move to Antigravity
After You Improve the Prompt, How Far Back Do You Regenerate? — Designing a Budget-Bounded Backfill
A prompt improvement only helps future output — thousands of old artifacts stay on the previous generation. This piece covers a budget-bounded backfill: selection scoring, edit-detection hashes, a pre-replacement gate, and a resumable cursor, with working code.
The day I finally landed a good revision of the tagging prompt in my app-metadata pipeline, the new tags were noticeably more specific and matched search behavior better. I felt done. Then it hit me: roughly 4,200 artifacts were still sitting there, generated by the old prompt.
A prompt improvement only applies to what you generate next. Everything already produced stays on the previous generation. The obvious answer — "just regenerate everything" — is one I have tried twice as an indie developer, and both times it cost me more than it saved. This article is the design I ended up with instead: a backfill that runs inside a budget, protects human edits, and refuses to replace good output with worse.
Why full regeneration is not the right default
Start with the cost math. In my case, each item averages about 2,800 input tokens (images included) and around 400 output tokens. At gemini-3.5-flash-class pricing, regenerating all 4,200 items lands somewhere in the single-to-low-double-digit dollar range. Purely on price, "run it overnight" looks fine.
The money was never the real problem.
First, overwrites. Some of those artifacts had been corrected by hand — tags fixed after user reports, descriptions reworded for policy reasons. A blind full regeneration reverts every one of them. I caused exactly this incident once, and because I had no record of which items were hand-edited, the cleanup took far longer than the generation cost.
Second, regressions. A prompt revision raises the average, not every individual result. In my measurements, about 8% of regenerated outputs came back worse than the old output under rule-based checks.
Third, quota contention. Backfill traffic consumes the same project rate limits as your steady-state jobs. The night my backfill overlapped the nightly batch, the production side started seeing 429s.
Concern
Full regeneration
Budget-bounded backfill
Cost
All at once, uncapped
Capped by a daily budget
Human edits
Overwritten unconditionally
Skipped via edit detection
Quality regressions
Slip through (~8% measured)
Rejected by a pre-replacement gate
Impact on steady-state jobs
Competes for rate limits
Separate execution window and pace
Interruption
Unclear state if it dies midway
Cursor resumes next day
Once I laid it out this way, the framing changed: backfill is not a one-off event. It is a small, permanent process that keeps ticking in the background.
Scoring what is worth regenerating
If you are not regenerating everything, you need a rule for what goes first. I combine four weighted factors:
Exposure — how often the artifact is actually viewed or used. Improving unused output can wait
Generation gap — how many prompt versions behind the item is; two or more gets priority
Quality signals — reports, low ratings, search mismatches attached to the old output
Cost — heavy inputs (high-resolution images) go later on a tie
# backfill_scorer.py — rank regeneration candidates# Problem it solves: deciding "what to redo first" by score, not gut feelimport sqlite3from dataclasses import dataclassWEIGHTS = {"exposure": 0.4, "gen_gap": 0.3, "quality": 0.2, "cost": 0.1}@dataclassclass Candidate: item_id: str exposure_norm: float # 0..1, normalized 30-day usage gen_gap: int # current prompt version minus version at generation quality_flags: int # count of reports / low ratings input_tokens: intdef score(c: Candidate, max_gap: int = 5) -> float: gap = min(c.gen_gap, max_gap) / max_gap quality = min(c.quality_flags, 3) / 3 cost_penalty = min(c.input_tokens / 8000, 1.0) # heavier input, lower priority return ( WEIGHTS["exposure"] * c.exposure_norm + WEIGHTS["gen_gap"] * gap + WEIGHTS["quality"] * quality - WEIGHTS["cost"] * cost_penalty )def top_candidates(db: str, limit: int = 500) -> list[tuple[str, float]]: conn = sqlite3.connect(db) rows = conn.execute( """SELECT item_id, exposure_norm, gen_gap, quality_flags, input_tokens FROM artifacts WHERE gen_gap >= 1 AND human_edited = 0""" ).fetchall() conn.close() scored = [(r[0], score(Candidate(*r))) for r in rows] return sorted(scored, key=lambda x: x[1], reverse=True)[:limit]
Do not agonize over the weights up front. Putting exposure first is the practical choice — only improvements to artifacts people actually see move perceived quality.
I also set a floor: candidates scoring below 0.25 simply do not get regenerated. If you treat "finish the whole backlog" as the goal, budget keeps flowing into regenerations nobody will notice. A backfill is not something you finish; it is something that goes quiet when nothing above the threshold remains, and wakes up again when the next prompt revision widens the generation gap.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A selection scorer that ranks artifacts by exposure, prompt-generation gap, quality signals, and input cost, so regeneration budget flows to what matters
✦Edit-detection hashes stored at write time that keep the backfill from clobbering human-corrected artifacts
✦A pre-replacement gate that rejects regressions by comparing old and new output, plus a cursor design that resumes cleanly after the daily budget runs out
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The most painful part of my full-regeneration incident was losing hand-made corrections. The fix is almost embarrassingly simple: store a hash of the content at the moment you generate it. Before regenerating an item, compare the current content against the stored hash. If they differ, a person touched it — skip and record.
# edit_guard.py — detect human edits# Problem it solves: backfill reverting hand-corrected artifactsimport hashlibdef content_hash(text: str) -> str: normalized = " ".join(text.split()) # whitespace drift must not trip detection return hashlib.sha256(normalized.encode("utf-8")).hexdigest()def is_human_edited(current_text: str, stored_hash: str | None) -> bool: if stored_hash is None: # Legacy rows with no hash are treated as edited (fail safe) return True return content_hash(current_text) != stored_hashdef save_generated(conn, item_id: str, text: str, prompt_ver: int) -> None: conn.execute( """UPDATE artifacts SET body = ?, gen_hash = ?, prompt_version = ?, human_edited = 0 WHERE item_id = ?""", (text, content_hash(text), prompt_ver, item_id), )
Two details matter. Normalize before hashing, so whitespace and line-ending drift do not masquerade as human edits. And treat legacy rows without a hash as edited — when you cannot tell, default to protecting the item. After rollout, about 6% of my candidates were skipped by edit detection. That 6% is precisely what my earlier full regenerations had been destroying.
Running inside a daily budget, with a cursor that resumes
I cut backfill execution by daily budget — dollars or tokens, either works. When the cap is hit, the run stops; the next day it continues from where it left off. The cursor is what makes "continues" true.
# backfill_runner.py — resumable runner with a spend cap# Problem it solves: execution control that survives crashes and never overspendsimport json, os, timefrom google import genaiDAILY_TOKEN_BUDGET = 600_000 # daily cap (input + output tokens)STATE_PATH = "backfill_state.json"def load_state() -> dict: if os.path.exists(STATE_PATH): with open(STATE_PATH) as f: return json.load(f) return {"date": "", "spent_tokens": 0, "done_ids": []}def run_batch(candidates: list[str], build_request, apply_result) -> None: state = load_state() today = time.strftime("%Y-%m-%d") if state["date"] != today: state = {"date": today, "spent_tokens": 0, "done_ids": state["done_ids"]} client = genai.Client() for item_id in candidates: if item_id in state["done_ids"]: continue # cursor: skip what is already done if state["spent_tokens"] >= DAILY_TOKEN_BUDGET: break # budget reached: stop for today req = build_request(item_id) resp = client.models.generate_content( model="gemini-flash-latest", contents=req["contents"], config=req["config"], ) usage = resp.usage_metadata state["spent_tokens"] += (usage.prompt_token_count or 0) + ( usage.candidates_token_count or 0 ) apply_result(item_id, resp.text) state["done_ids"].append(item_id) with open(STATE_PATH, "w") as f: json.dump(state, f) # persist per item, so a crash loses nothing
Decrement the budget with measured usage_metadata, not estimates. Pre-run estimates drift badly — image resolution differences alone can double actual token counts. When I managed by estimate, month-end billing kept surprising me; switching to measured consumption brought the error down to a few percent.
If the backfill is not urgent, pushing it through the Batch API lowers the unit price and stops it from competing with daytime traffic for rate limits. My setup settled into "collect today's top candidates, submit one batch at night." Separating the execution window from steady-state jobs eliminated the simultaneous-429 evenings entirely.
Comparing old and new before replacing
Regenerated output is worse than the old output more often than intuition suggests. I run a mechanical check immediately before replacement.
# replace_gate.py — pre-replacement comparison gate# Problem it solves: "newer must be better" quietly regressing qualityBANNED = {"best ever", "ultimate", "No.1"}def gate(old: dict, new: dict) -> str: """Returns 'replace' / 'keep' / 'hold'.""" # 1) structural checks: tag count and description length within bounds if not (3 <= len(new["tags"]) <= 8): return "keep" if not (40 <= len(new["description"]) <= 160): return "keep" # 2) banned phrasing if any(b in new["description"] for b in BANNED): return "keep" # 3) clearly thinner information: hold for human review if len(set(new["tags"])) < len(set(old["tags"])) - 2: return "hold" return "replace"def apply_with_archive(conn, item_id: str, old_body: str, new_body: str) -> None: # Archive the old value, then replace atomically (keeps rollback open) conn.execute( "INSERT INTO artifact_history (item_id, body) VALUES (?, ?)", (item_id, old_body), ) conn.execute( "UPDATE artifacts SET body = ? WHERE item_id = ?", (new_body, item_id) ) conn.commit()
The verdict is three-valued on purpose: keep when the new output is clearly worse, hold when it is ambiguous (a human looks at the queue), replace when it passes. And every replacement archives the old value first. When a regression slips through the gate, history lets you restore a single item instead of panicking.
The hold rate is worth watching as a metric. Mine sits at 2–3% normally, but after one revision it jumped to 15%. The cause was a flaw in the prompt revision itself; I paused the backfill and redid the revision. The gate's hold rate turns out to be a lagging indicator of prompt-revision quality.
What months of operation taught me
A few things you will not find in official documentation.
Weekly score recomputation is enough. Recomputing before every run reshuffles the ranking and fights the cursor's done-list — I nearly processed the same items twice. Freeze the candidate list weekly; do not move the ranking mid-week.
Design for a backfill that never finishes. Making "clear the backlog" a KPI tempts you to lower the threshold just to consume it. Let it go quiet when nothing scores above the floor; the next prompt revision reopens the gap and it resumes on its own. Since adopting that rhythm, wasted spend has essentially disappeared.
Install the edit-detection hash before you need the backfill. Legacy rows without hashes all fall to the safe side (skipped), so the later you add hashing, the smaller your backfill's reach. If you run a generation pipeline today, ship the save_generated hash write ahead of any backfill planning. Your future self will be grateful.
A prompt improvement only becomes an asset-wide improvement once it reaches backward. Start with one line: add the edit-detection hash to today's pipeline. If you are wrestling with the same problem, I hope these notes save you a detour.
Share
Thank You for Reading
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.