⬡ Advanced/2026-07-02Advanced

After You Improve the Prompt, How Far Back Do You Regenerate? — Designing a Budget-Bounded Backfill

A prompt improvement only helps future output — thousands of old artifacts stay on the previous generation. This piece covers a budget-bounded backfill: selection scoring, edit-detection hashes, a pre-replacement gate, and a resumable cursor, with working code.

gemini-api²⁵⁹ backfill operations⁵ cost-optimization²⁸ pipeline⁹

✦ Premium Article

The day I finally landed a good revision of the tagging prompt in my app-metadata pipeline, the new tags were noticeably more specific and matched search behavior better. I felt done. Then it hit me: roughly 4,200 artifacts were still sitting there, generated by the old prompt.

A prompt improvement only applies to what you generate next. Everything already produced stays on the previous generation. The obvious answer — "just regenerate everything" — is one I have tried twice as an indie developer, and both times it cost me more than it saved. This article is the design I ended up with instead: a backfill that runs inside a budget, protects human edits, and refuses to replace good output with worse.

Why full regeneration is not the right default

Start with the cost math. In my case, each item averages about 2,800 input tokens (images included) and around 400 output tokens. At gemini-3.5-flash-class pricing, regenerating all 4,200 items lands somewhere in the single-to-low-double-digit dollar range. Purely on price, "run it overnight" looks fine.

The money was never the real problem.

First, overwrites. Some of those artifacts had been corrected by hand — tags fixed after user reports, descriptions reworded for policy reasons. A blind full regeneration reverts every one of them. I caused exactly this incident once, and because I had no record of which items were hand-edited, the cleanup took far longer than the generation cost.

Second, regressions. A prompt revision raises the average, not every individual result. In my measurements, about 8% of regenerated outputs came back worse than the old output under rule-based checks.

Third, quota contention. Backfill traffic consumes the same project rate limits as your steady-state jobs. The night my backfill overlapped the nightly batch, the production side started seeing 429s.

Concern	Full regeneration	Budget-bounded backfill
Cost	All at once, uncapped	Capped by a daily budget
Human edits	Overwritten unconditionally	Skipped via edit detection
Quality regressions	Slip through (~8% measured)	Rejected by a pre-replacement gate
Impact on steady-state jobs	Competes for rate limits	Separate execution window and pace
Interruption	Unclear state if it dies midway	Cursor resumes next day

Once I laid it out this way, the framing changed: backfill is not a one-off event. It is a small, permanent process that keeps ticking in the background.

Scoring what is worth regenerating

If you are not regenerating everything, you need a rule for what goes first. I combine four weighted factors:

Exposure — how often the artifact is actually viewed or used. Improving unused output can wait
Generation gap — how many prompt versions behind the item is; two or more gets priority
Quality signals — reports, low ratings, search mismatches attached to the old output
Cost — heavy inputs (high-resolution images) go later on a tie

# backfill_scorer.py — rank regeneration candidates
# Problem it solves: deciding "what to redo first" by score, not gut feel
import sqlite3
from dataclasses import dataclass
 
WEIGHTS = {"exposure": 0.4, "gen_gap": 0.3, "quality": 0.2, "cost": 0.1}
 
@dataclass
class Candidate:
    item_id: str
    exposure_norm: float   # 0..1, normalized 30-day usage
    gen_gap: int           # current prompt version minus version at generation
    quality_flags: int     # count of reports / low ratings
    input_tokens: int
 
def score(c: Candidate, max_gap: int = 5) -> float:
    gap = min(c.gen_gap, max_gap) / max_gap
    quality = min(c.quality_flags, 3) / 3
    cost_penalty = min(c.input_tokens / 8000, 1.0)  # heavier input, lower priority
    return (
        WEIGHTS["exposure"] * c.exposure_norm
        + WEIGHTS["gen_gap"] * gap
        + WEIGHTS["quality"] * quality
        - WEIGHTS["cost"] * cost_penalty
    )
 
def top_candidates(db: str, limit: int = 500) -> list[tuple[str, float]]:
    conn = sqlite3.connect(db)
    rows = conn.execute(
        """SELECT item_id, exposure_norm, gen_gap, quality_flags, input_tokens
           FROM artifacts WHERE gen_gap >= 1 AND human_edited = 0"""
    ).fetchall()
    conn.close()
    scored = [(r[0], score(Candidate(*r))) for r in rows]
    return sorted(scored, key=lambda x: x[1], reverse=True)[:limit]

Do not agonize over the weights up front. Putting exposure first is the practical choice — only improvements to artifacts people actually see move perceived quality.

I also set a floor: candidates scoring below 0.25 simply do not get regenerated. If you treat "finish the whole backlog" as the goal, budget keeps flowing into regenerations nobody will notice. A backfill is not something you finish; it is something that goes quiet when nothing above the threshold remains, and wakes up again when the next prompt revision widens the generation gap.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A selection scorer that ranks artifacts by exposure, prompt-generation gap, quality signals, and input cost, so regeneration budget flows to what matters

✦Edit-detection hashes stored at write time that keep the backfill from clobbering human-corrected artifacts

✦A pre-replacement gate that rejects regressions by comparing old and new output, plus a cursor design that resumes cleanly after the daily budget runs out

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Protecting human edits with a write-time hash

The most painful part of my full-regeneration incident was losing hand-made corrections. The fix is almost embarrassingly simple: store a hash of the content at the moment you generate it. Before regenerating an item, compare the current content against the stored hash. If they differ, a person touched it — skip and record.

# edit_guard.py — detect human edits
# Problem it solves: backfill reverting hand-corrected artifacts
import hashlib
 
def content_hash(text: str) -> str:
    normalized = " ".join(text.split())  # whitespace drift must not trip detection
    return hashlib.sha256(normalized.encode("utf-8")).hexdigest()
 
def is_human_edited(current_text: str, stored_hash: str | None) -> bool:
    if stored_hash is None:
        # Legacy rows with no hash are treated as edited (fail safe)
        return True
    return content_hash(current_text) != stored_hash
 
def save_generated(conn, item_id: str, text: str, prompt_ver: int) -> None:
    conn.execute(
        """UPDATE artifacts
           SET body = ?, gen_hash = ?, prompt_version = ?, human_edited = 0
           WHERE item_id = ?""",
        (text, content_hash(text), prompt_ver, item_id),
    )

Two details matter. Normalize before hashing, so whitespace and line-ending drift do not masquerade as human edits. And treat legacy rows without a hash as edited — when you cannot tell, default to protecting the item. After rollout, about 6% of my candidates were skipped by edit detection. That 6% is precisely what my earlier full regenerations had been destroying.

Running inside a daily budget, with a cursor that resumes

I cut backfill execution by daily budget — dollars or tokens, either works. When the cap is hit, the run stops; the next day it continues from where it left off. The cursor is what makes "continues" true.

# backfill_runner.py — resumable runner with a spend cap
# Problem it solves: execution control that survives crashes and never overspends
import json, os, time
from google import genai
 
DAILY_TOKEN_BUDGET = 600_000     # daily cap (input + output tokens)
STATE_PATH = "backfill_state.json"
 
def load_state() -> dict:
    if os.path.exists(STATE_PATH):
        with open(STATE_PATH) as f:
            return json.load(f)
    return {"date": "", "spent_tokens": 0, "done_ids": []}
 
def run_batch(candidates: list[str], build_request, apply_result) -> None:
    state = load_state()
    today = time.strftime("%Y-%m-%d")
    if state["date"] != today:
        state = {"date": today, "spent_tokens": 0, "done_ids": state["done_ids"]}
 
    client = genai.Client()
    for item_id in candidates:
        if item_id in state["done_ids"]:
            continue  # cursor: skip what is already done
        if state["spent_tokens"] >= DAILY_TOKEN_BUDGET:
            break     # budget reached: stop for today
        req = build_request(item_id)
        resp = client.models.generate_content(
            model="gemini-flash-latest",
            contents=req["contents"],
            config=req["config"],
        )
        usage = resp.usage_metadata
        state["spent_tokens"] += (usage.prompt_token_count or 0) + (
            usage.candidates_token_count or 0
        )
        apply_result(item_id, resp.text)
        state["done_ids"].append(item_id)
        with open(STATE_PATH, "w") as f:
            json.dump(state, f)  # persist per item, so a crash loses nothing

Decrement the budget with measured usage_metadata, not estimates. Pre-run estimates drift badly — image resolution differences alone can double actual token counts. When I managed by estimate, month-end billing kept surprising me; switching to measured consumption brought the error down to a few percent.

If the backfill is not urgent, pushing it through the Batch API lowers the unit price and stops it from competing with daytime traffic for rate limits. My setup settled into "collect today's top candidates, submit one batch at night." Separating the execution window from steady-state jobs eliminated the simultaneous-429 evenings entirely.

Comparing old and new before replacing

Regenerated output is worse than the old output more often than intuition suggests. I run a mechanical check immediately before replacement.

# replace_gate.py — pre-replacement comparison gate
# Problem it solves: "newer must be better" quietly regressing quality
BANNED = {"best ever", "ultimate", "No.1"}
 
def gate(old: dict, new: dict) -> str:
    """Returns 'replace' / 'keep' / 'hold'."""
    # 1) structural checks: tag count and description length within bounds
    if not (3 <= len(new["tags"]) <= 8):
        return "keep"
    if not (40 <= len(new["description"]) <= 160):
        return "keep"
    # 2) banned phrasing
    if any(b in new["description"] for b in BANNED):
        return "keep"
    # 3) clearly thinner information: hold for human review
    if len(set(new["tags"])) < len(set(old["tags"])) - 2:
        return "hold"
    return "replace"
 
def apply_with_archive(conn, item_id: str, old_body: str, new_body: str) -> None:
    # Archive the old value, then replace atomically (keeps rollback open)
    conn.execute(
        "INSERT INTO artifact_history (item_id, body) VALUES (?, ?)",
        (item_id, old_body),
    )
    conn.execute(
        "UPDATE artifacts SET body = ? WHERE item_id = ?", (new_body, item_id)
    )
    conn.commit()

The verdict is three-valued on purpose: keep when the new output is clearly worse, hold when it is ambiguous (a human looks at the queue), replace when it passes. And every replacement archives the old value first. When a regression slips through the gate, history lets you restore a single item instead of panicking.

The hold rate is worth watching as a metric. Mine sits at 2–3% normally, but after one revision it jumped to 15%. The cause was a flaw in the prompt revision itself; I paused the backfill and redid the revision. The gate's hold rate turns out to be a lagging indicator of prompt-revision quality.

What months of operation taught me

A few things you will not find in official documentation.

Weekly score recomputation is enough. Recomputing before every run reshuffles the ranking and fights the cursor's done-list — I nearly processed the same items twice. Freeze the candidate list weekly; do not move the ranking mid-week.

Design for a backfill that never finishes. Making "clear the backlog" a KPI tempts you to lower the threshold just to consume it. Let it go quiet when nothing scores above the floor; the next prompt revision reopens the gap and it resumes on its own. Since adopting that rhythm, wasted spend has essentially disappeared.

Install the edit-detection hash before you need the backfill. Legacy rows without hashes all fall to the safe side (skipped), so the later you add hashing, the smaller your backfill's reach. If you run a generation pipeline today, ship the save_generated hash write ahead of any backfill planning. Your future self will be grateful.

A prompt improvement only becomes an asset-wide improvement once it reaches backward. Start with one line: add the edit-detection hash to today's pipeline. If you are wrestling with the same problem, I hope these notes save you a detour.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.