GEMINI LABJP
MODEL — Gemini 3.5 Flash is generally available as Google's top pick for agentic and coding tasksAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Linux sandboxesWEBHOOK — Event-driven webhooks now cover the Batch API and long-running ops, removing pollingSECURITY — From June 19, requests from unrestricted API keys are blocked — review your key limitsDEPRECATED — Two image-preview models shut down June 25 — migrate any preview-dependent flowsCODEASSIST — Since June 18, individual Code Assist extensions and CLI stopped serving Pro/Ultra tiersMODEL — Gemini 3.5 Flash is generally available as Google's top pick for agentic and coding tasksAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Linux sandboxesWEBHOOK — Event-driven webhooks now cover the Batch API and long-running ops, removing pollingSECURITY — From June 19, requests from unrestricted API keys are blocked — review your key limitsDEPRECATED — Two image-preview models shut down June 25 — migrate any preview-dependent flowsCODEASSIST — Since June 18, individual Code Assist extensions and CLI stopped serving Pro/Ultra tiers
Articles/Advanced
Advanced/2026-04-24Advanced

Safely Migrating Gemini Model Versions with Shadow Traffic — A Production Pattern for Measuring Output Drift

Stop treating Gemini model migrations as a coin flip. This guide walks through a production-ready shadow traffic architecture — duplicate real inputs to the new model, quantify output drift, and cut over progressively. Includes Python and Cloud Tasks code you can ship today.

gemini92model-migration7shadow-trafficproduction126sre2canary2

Premium Article

When Google announced that Gemini 2.0 Flash would be deprecated in June 2026, I had three production apps running on it. My first reaction was not "time to migrate" but "I have no confidence the new model will behave the same way on real traffic." Running a few dozen staged test cases never tells you how a model behaves against the long tail of real user inputs. Rolling the dice on a deploy is not a strategy I'm willing to take, even for a side project.

Shadow traffic solves this structurally. You send each production request to both the old and new models in parallel, return the old model's response to the user, and quietly collect the new model's response for later comparison. Over a few days of real traffic, you accumulate enough data to ask concrete questions: "On what percentage of inputs do the outputs materially diverge?" and "Which categories of input are most at risk?" Those answers replace gut feel with evidence.

I run several Gemini-powered products, and reinventing this plumbing every time a model deprecates has gotten old. What follows is the pattern I've distilled across four sites, with the code and the operational details that made it survive contact with real traffic.

Why LLM migrations default to "deploy and hope"

Migrating between LLM versions differs from migrating between API versions in one critical way: the wire-level contract (request/response shapes) and the behavioral contract (what the model actually says) move independently. You can keep every endpoint identical and still break users, because the same prompt can produce a subtly different answer.

One migration I lived through had the old model replying in Japanese and the new model occasionally mixing English phrases into the output. A small system-instruction tweak fixed it, but if the first signal had been customer complaints, the damage would already be done. Unit tests and regression suites only cover the cases you already thought of — which is precisely the set of cases that never break in production. The long tail of real inputs is exactly what shadow traffic lets you examine before flipping the switch.

The core shadow traffic architecture

The idea is straightforward. Your frontend (the production path) receives a user request, calls the primary (old) model, returns that response to the user, and — in parallel and asynchronously — fires the same input at the shadow (new) model. The user never sees the shadow output; it exists only to measure drift.

Three invariants must hold, or the whole exercise becomes dangerous.

First, shadow calls must not affect user-visible latency. If you await the shadow, you've effectively serialised two API calls and doubled p95. I made exactly this mistake on my first attempt.

Second, shadow failures must not leak into the user path. An exception from the shadow call should never, under any circumstance, alter what the user receives.

Third, the shadow response must be incapable of being served to a user. In multi-worker setups, it's surprisingly easy to accidentally cross-contaminate responses. Treat that as a correctness issue, not a race condition to chase later.

Here is the minimal implementation that honors all three.

# shadow_traffic.py
# Returns the primary response to the user, then dispatches the same input to
# the shadow asynchronously. User latency is unaffected; shadow failures are
# logged and swallowed.
 
import asyncio
import logging
from google import genai
 
logger = logging.getLogger(__name__)
 
class ShadowTrafficClient:
    def __init__(
        self,
        primary_model: str = "gemini-2.5-pro",
        shadow_model: str = "gemini-3-pro",
        shadow_sample_rate: float = 0.1,  # shadow 10% of production traffic
    ):
        self.client = genai.Client()
        self.primary_model = primary_model
        self.shadow_model = shadow_model
        self.shadow_sample_rate = shadow_sample_rate
 
    async def generate(self, prompt: str, request_id: str) -> str:
        # 1. Call the primary and prepare to return its response
        primary_response = await self._call_model(
            self.primary_model, prompt, request_id, "primary"
        )
 
        # 2. Sample, then fire-and-forget into the shadow path
        import random
        if random.random() < self.shadow_sample_rate:
            asyncio.create_task(
                self._shadow_execute(prompt, request_id, primary_response)
            )
 
        return primary_response  # user only ever sees the primary
 
    async def _call_model(self, model: str, prompt: str, req_id: str, role: str) -> str:
        try:
            response = await self.client.aio.models.generate_content(
                model=model,
                contents=prompt,
            )
            return response.text
        except Exception as e:
            # Primary failures bubble up; shadow failures are caught below.
            logger.warning(f"[{req_id}] {role} call failed: {e}")
            raise
 
    async def _shadow_execute(self, prompt: str, req_id: str, primary_response: str):
        # Any shadow failure is swallowed to prevent contaminating user experience.
        try:
            shadow_response = await self._call_model(
                self.shadow_model, prompt, req_id, "shadow"
            )
            await self._log_comparison(req_id, prompt, primary_response, shadow_response)
        except Exception as e:
            logger.warning(f"[{req_id}] shadow failed (ignored): {e}")
 
    async def _log_comparison(self, req_id, prompt, primary, shadow):
        # Write to BigQuery / Firestore / S3. Implementation comes later.
        pass

The operative detail: asyncio.create_task returns immediately, so the user's response time depends solely on the primary. If you ever catch yourself about to type await shadow_task, step away from the keyboard — I promise, I've been there.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
You'll walk away with a shadow traffic blueprint you can deploy this week, so Gemini model migrations stop being a 'ship-and-pray' exercise
You'll learn a fire-and-forget async shadow pattern that keeps user latency unchanged, plus a sampling strategy to keep costs from doubling
You'll be able to implement a Feature Flag–driven 1% → 10% → 50% → 100% rollout with sub-second rollback — without redeploying the app
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Advanced2026-06-27
Your Gemini Completion Event Will Arrive Twice — An Idempotent Sink That Makes Webhook + Reconciliation Effectively Once-Only
Once you receive Gemini long-running operations over a Webhook and back it up with a reconciliation poller, the same completion arrives twice and publishing or billing runs twice. Build an idempotent sink with a normalized key and a claim-run-commit pattern that keeps side effects effectively once-only.
Advanced2026-06-14
Trusting Gemini Structured Output in Production — Schema Design, Double Validation, and Bounded Retries
Gemini's structured output guarantees parseable JSON, not correct values. Notes on schema design with @google/genai, why propertyOrdering matters, a Zod double-validation layer, handling MAX_TOKENS truncation, and a bounded-retry extraction pipeline.
Advanced2026-06-14
Switching Image Models Quietly Degrades Quality — A Gate That Catches It Without Manual Review
When you move image generation from preview to GA models, the API keeps returning 200 and quality slips silently. This is the three-layer gate I built to detect that drift without staring at every image: deterministic property checks, multimodal embedding similarity, and a Gemini judge, wired together in Python with thresholds and a cutover procedure.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →