GEMINI LABJP
FLASH GA — Gemini 3.5 Flash is now generally available, billed as the most intelligent model for sustained frontier performance on agentic and coding tasksTOGGLE — From Jun 16 the Gemini 3.5 Flash feature toggle is removed in the Global, US, and EU multi-regions, so check any configs that depend on itAGENTS — Managed Agents launched in public preview, letting developers build and deploy autonomous, stateful agents inside Google-hosted isolated Linux sandboxesIMAGE — The image preview models gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25; migrate to their successorsSEARCH — File Search now supports multimodal search, natively embedding and searching images via the gemini-embedding-2 modelCLI — Gemini CLI and Code Assist end individual access on Jun 18; free users and AI Pro/Ultra subscribers are directed to the Antigravity CLIFLASH GA — Gemini 3.5 Flash is now generally available, billed as the most intelligent model for sustained frontier performance on agentic and coding tasksTOGGLE — From Jun 16 the Gemini 3.5 Flash feature toggle is removed in the Global, US, and EU multi-regions, so check any configs that depend on itAGENTS — Managed Agents launched in public preview, letting developers build and deploy autonomous, stateful agents inside Google-hosted isolated Linux sandboxesIMAGE — The image preview models gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25; migrate to their successorsSEARCH — File Search now supports multimodal search, natively embedding and searching images via the gemini-embedding-2 modelCLI — Gemini CLI and Code Assist end individual access on Jun 18; free users and AI Pro/Ultra subscribers are directed to the Antigravity CLI
Articles/API / SDK
API / SDK/2026-06-15Advanced

When the Default Model Silently Upgrades: Catching Prompt Regressions in Numbers

Gemini 3.5 Flash is now the default and you can no longer turn it off. Assuming your responses can shift without you touching the prompt, here is how to bundle prompt, model, and sampling into one variant and catch regressions with canaries and an LLM judge — in working code.

gemini77gemini-api233prompt-engineering15canary2llm-as-judge5production108

Premium Article

The other day Gemini 3.5 Flash reached general availability, became the default in Enterprise apps, and the toggle to disable it simply disappeared. Reading that, I thought back to an unsettling half-day from about six months earlier. As an indie developer running my own chat app, I had not touched the prompt by a single character, yet one morning the tone of the responses turned oddly stiff for one slice of users. The change log was empty, and it ate my entire morning before I gave up looking for a cause I could prove.

My best guess is that a model checkpoint had been swapped behind the scenes. I could never confirm it. That inability to confirm was the real problem. I had only ever recorded when, who, and why a prompt changed — so when the model moved, I had no ruler in hand to separate cause from coincidence.

Instrument for "it changed without me touching it"

When the default rises to 3.5 Flash and cannot be disabled, any automation that calls the API without an explicit model is now subject, by policy rather than by accident, to "behavior changes one day." Since this comes from the platform side, the only place to absorb it is your own design.

There is really only one way to absorb it: snapshot the conditions that produced each response and measure quality continuously, per condition. For open-ended workloads like chat, unit tests tell you nothing beyond "no error thrown." A decay where error stays at zero while quality quietly sinks slips right past them. That is exactly why you need to treat a prompt as an explicit version and run several versions side by side on production traffic, comparing them in numbers.

The point I want to press hardest: do not make the unit of versioning the prompt string. If you version only the prompt while the model and sampling parameters move independently underneath, you can never pin down which factor a measured difference belongs to. Make the unit a variant that bundles prompt plus model ID plus sampling config. That is the spine of this whole piece.

The shape of it — four parts, and why shadow matters

There are four parts to build.

The Prompt Registry holds variants as Firestore documents, with a status field controlling whether each is serving, waiting in the wings, or retired. The Traffic Splitter picks a variant deterministically from a user ID and a task key, so the same person always gets the same version and the comparison never breaks mid-stream. The Metrics Collector is a thin wrapper around the API call that always writes one record: which variant, how much latency and how many tokens, and whether it succeeded or failed. The Evaluation Loop samples the accumulated logs, scores them with a judge model, and looks at the gap in mean score between variants.

The single most useful design choice is splitting status into active and shadow. An active variant goes out to real traffic; a shadow one does not. It stays in the wings and only gets offline scoring on a small sample. Having a gate where you can discard "this version is clearly weaker" before it ever touches a user noticeably reduces production incidents. After I added that gate, I became far bolder about trying new variants — useful when you are a one-person team and every regression lands on you alone.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A Firestore registry that bundles prompt, model, and sampling into a single variant, with deterministic hashing for stable assignment
A three-stage path — shadow, canary, promotion — that freezes win/lose calls until a minimum sample size is reached
An evaluation batch that uses a stronger gemini-3-pro as judge and flags regressions with mean-score deltas and a quick z value
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-03-25
Building a Prompt Evaluation & Optimization Pipeline with Gemini API — Automated Quality Scoring with LLM-as-Judge
Learn how to build a prompt evaluation pipeline using Gemini API. Covers the LLM-as-Judge pattern, A/B testing prompts, automated quality scoring, and cost-quality optimization for production systems.
API / SDK2026-06-13
Where to Adopt Gemini 3.5 Flash GA First — Per-Workload Evaluation and a Staged Rollout with a Model Router
How I migrated production workloads to Gemini 3.5 Flash GA in stages: a per-workload evaluation harness, measured results, an env-based model router, and rollback design.
API / SDK2026-05-23
Gemini API × Sentry: A Production Pipeline for LLM Error Tracking and Prompt Failure Observability
Pair Sentry's error tracking with Gemini-specific failure modes so you can catch safety filter blocks, recitation rejections, empty completions, and quiet latency drift in production.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →