GEMINI LABJP
MODEL — Gemini 3.5 Flash reaches general availability and becomes gemini-flash-latestAPI — The Interactions API hits GA as the primary way to work with Gemini models and agentsAGENT — Managed Agents enter public preview, running stateful agents in isolated Linux sandboxesAPI — Background execution lands, letting you fire long-running jobs and collect results laterSEARCH — File Search now embeds and searches images natively via gemini-embedding-2NOTICE — Since June 19, requests from unrestricted API keys are blockedMODEL — Gemini 3.5 Flash reaches general availability and becomes gemini-flash-latestAPI — The Interactions API hits GA as the primary way to work with Gemini models and agentsAGENT — Managed Agents enter public preview, running stateful agents in isolated Linux sandboxesAPI — Background execution lands, letting you fire long-running jobs and collect results laterSEARCH — File Search now embeds and searches images natively via gemini-embedding-2NOTICE — Since June 19, requests from unrestricted API keys are blocked
Articles/Dev Tools
Dev Tools/2026-06-30Advanced

Tracing Which Prompt Revision Moved Your Quality — Prompt Versioning for a Gemini Pipeline

Editing prompts in place erases the trail: when quality shifts you can't tell whether the model moved or your wording did. Here's a small system that pins prompts by content hash, stamps every generation with the model ID and revision, and bisects a quality drop down to the exact revision boundary, with copy-paste Python.

gemini93gemini-devprompt-managementproduction127pipeline8

Premium Article

On my own pipeline that drafts articles with Gemini every day, the Japanese drafts came out oddly stiff one morning. I had edited a single line of the prompt in place the night before, but I couldn't tell from the logs whether the stiffness came from that edit or from gemini-flash-latest quietly resolving to a newer model. I had overwritten the prompt directly, so nothing in the history recorded what changed or when.

When you run a generation pipeline as an indie developer for long enough, this "I can't trace the cause" feeling quietly compounds. The model moves on its own, and the prompt moves by your hand. Assuming both move, I at least wanted the prompt side to keep an exact record of what changed and when. Here is the smallest system I could build to do that.

Why editing in place erases the cause

Inlining prompts as f-strings and editing them in place has two holes.

The first is missing history. A git diff shows the wording change, but the generation log has nothing tying a given day's output to the wording it came from. Output and prompt text aren't linked, so you can't reconcile them after the fact.

The second is confounding. Default -latest aliases change their underlying model without notice. You also change the prompt yourself. When quality moves while both are unpinned, you cannot separate the two in principle. Diagnosis starts by pinning one side so it can't move.

I underestimated this for a while and fixed things by gut feeling. After a few detours where a "fixed" prompt introduced a different regression, I finally arrived at the obvious move: treat the wording as a versioned artifact.

A tiny registry that pins prompts by content hash

You don't need a heavy prompt platform. Store each prompt as one file per revision and use its content hash as the revision ID.

# prompt_registry.py
import hashlib
import json
from pathlib import Path
 
PROMPT_DIR = Path("prompts")
 
def _content_hash(text: str) -> str:
    # Normalize newline noise before hashing (so CRLF/LF doesn't split the revision)
    normalized = text.replace("\r\n", "\n").strip()
    return hashlib.sha256(normalized.encode("utf-8")).hexdigest()[:12]
 
def load_prompt(prompt_id: str) -> dict:
    """Read prompts/<prompt_id>.txt and return it with its content hash."""
    path = PROMPT_DIR / f"{prompt_id}.txt"
    if not path.exists():
        raise FileNotFoundError(f"prompt not found: {prompt_id}")
    text = path.read_text(encoding="utf-8")
    return {
        "id": prompt_id,
        "revision": _content_hash(text),
        "text": text,
    }

The key choice is making the revision ID a content hash rather than a running number. A counter invites the "changed the body but forgot to bump the number" bug; a content hash changes the moment a single character changes. Conversely, if you revert to identical wording, the revision returns to its old value, so you never get "same text, treated as a new revision."

Keep the wording out of code, in text files

Place prompts in text files outside your code, like prompts/article_ja_draft.txt. Then git log -- prompts/article_ja_draft.txt reads the revision history of that wording alone, and the rollback below becomes "revert one file."

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Pin each prompt as a file keyed by content hash and stamp every generation log line with both the model ID and the prompt revision, so you can separate quality drift caused by the model from drift caused by your wording
Get a copy-paste Python bisect helper that walks your score timeline and pins the exact revision boundary where quality dropped
Learn a lightweight rollback workflow that pins a known-good revision and reverts a broken one in a single file operation
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Dev Tools2026-06-17
Running Gemini Chat History on Redis — Field Notes on Not Losing Conversation State in Production
Keep a Gemini ChatSession in process memory and it evaporates on every redeploy or scale event. Here is how I back it with Redis in production, covering token budgets, concurrent sends, SDK coupling, and graceful degradation, with the code I actually run.
Dev Tools2026-06-18
Keeping Nightly Batches Alive After the Gemini CLI Stops Responding: A google-genai SDK Fallback
On June 18 the Gemini CLI stops answering requests. Here is a small fallback harness that probes whether the CLI can still respond and quietly reroutes unattended batch jobs to the google-genai SDK, built from my own automation.
Dev Tools2026-06-17
Catching Deprecated Gemini Models in CI ― A Guard for Back-to-Back Shutdown Deadlines
When shutdowns and deprecations pile up, build a CI check that mechanically finds stale Gemini model strings across your repo. Includes a deprecation registry, a scanner, and a days-remaining warn/fail tier you can copy and run.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →