GEMINI LABJP
CLI — As of Jun 18, Gemini CLI and the Gemini Code Assist IDE extensions stop serving AI Pro/Ultra and free individual users; Antigravity CLI is the successorFLASH — The Gemini 3.5 series begins with 3.5 Flash, built for agents and coding with strength on long-horizon tasksDEEPTHINK — Gemini 3 Deep Think is rolling out to Google AI Ultra as the top reasoning mode for math, science, and logicAPP — The Gemini app gains a Daily Brief, a redesigned interface, the Gemini Omni video model, and a personal agent called Gemini SparkDESIGN — A new design language, Neural Expressive, rebuilds the experience for richer visuals and faster switching between modalitiesULTRA — Google AI Ultra bundles top model access, Deep Research, Veo 3 video, and a 1M-token context windowCLI — As of Jun 18, Gemini CLI and the Gemini Code Assist IDE extensions stop serving AI Pro/Ultra and free individual users; Antigravity CLI is the successorFLASH — The Gemini 3.5 series begins with 3.5 Flash, built for agents and coding with strength on long-horizon tasksDEEPTHINK — Gemini 3 Deep Think is rolling out to Google AI Ultra as the top reasoning mode for math, science, and logicAPP — The Gemini app gains a Daily Brief, a redesigned interface, the Gemini Omni video model, and a personal agent called Gemini SparkDESIGN — A new design language, Neural Expressive, rebuilds the experience for richer visuals and faster switching between modalitiesULTRA — Google AI Ultra bundles top model access, Deep Research, Veo 3 video, and a 1M-token context window
Articles/API / SDK
API / SDK/2026-06-17Advanced

Keep Your Flash-to-Pro Routing Threshold Honest with Shadow Re-evaluation

A Flash-generates, Pro-on-low-confidence router starts drifting the moment you hand-pick its threshold. This is a working build of a loop that samples your kept-Flash outputs, scores them against Pro, and recalibrates the threshold from a quality budget.

gemini83gemini-api239model-routingshadow-evaluationcost-optimization24production113

Premium Article

After I saw that Gemini Enterprise had made 3.5 Flash the default on June 8, with no way to turn it off, I opened up the router in my publishing automation for the first time in a while. It is the familiar two-stage setup: Flash classifies article metadata, and only the shaky outputs get sent to Pro. The threshold read if confidence < 0.7, escalate to Pro.

The problem was that I could not remember the last time I had revisited that 0.7. I had set it half a year ago after eyeing a handful of samples and deciding "somewhere around here." Models had been updated since, and the mix of articles I handle had shifted. Only the threshold stayed frozen, and nobody had checked whether it was still reasonable.

This article is a build log for fixing exactly that: a threshold that quietly goes stale where it sits. It is not about wiring up a static router. It assumes you already have one, and it adds a loop that verifies — after the fact — whether each routing decision was correct, and adjusts the threshold accordingly.

A static threshold is already stale the moment you set it

Confidence-based model routers have a weakness that is easy to miss. When we choose the threshold, we judge quality against "the samples we have right now." But production traffic shifts over time. New article categories appear, input lengths change, and Flash itself gets updated so that the meaning of its self-reported confidence drifts.

When that happens, the threshold "drifts relative to reality even though it never moved." You might have been passing everything Flash returned as confidence: 0.75 as "confident," but if the quality in that band has quietly dropped, a frozen threshold can't notice. The reverse happens too: you keep sending work to Pro that Flash could have handled fine, and you pay for it.

The nasty part is that both kinds of decay progress in silence. No errors fire. Either the bill creeps up or quality creeps down, and looking at your own logs won't tell you whether the threshold was right — because you never had the strong model grade the outputs you let through on the Flash side.

A quick refresher on the minimal confidence router

Before the recalibration part, let me pin down the underlying router in its smallest form. Flash returns an answer and a self-reported confidence via structured output, and you call Pro only when confidence falls below the threshold.

from google import genai
from google.genai import types
from pydantic import BaseModel
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
FAST_MODEL = "gemini-3.5-flash"
STRONG_MODEL = "gemini-2.5-pro"  # swap to 3.5 Pro once it is GA
 
class Verdict(BaseModel):
    answer: str
    confidence: float  # self-reported, 0.0 to 1.0
 
def classify_fast(text: str) -> Verdict:
    res = client.models.generate_content(
        model=FAST_MODEL,
        contents=f"Classify the following article and add a confidence from 0 to 1.\n\n{text}",
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema=Verdict,
            temperature=0,
        ),
    )
    return res.parsed
 
def classify_strong(text: str) -> str:
    res = client.models.generate_content(
        model=STRONG_MODEL,
        contents=f"Classify the following article.\n\n{text}",
        config=types.GenerateContentConfig(temperature=0),
    )
    return res.text.strip()
 
def route(text: str, threshold: float) -> dict:
    v = classify_fast(text)
    if v.confidence < threshold:
        return {"answer": classify_strong(text), "model": STRONG_MODEL,
                "confidence": v.confidence, "escalated": True}
    return {"answer": v.answer, "model": FAST_MODEL,
            "confidence": v.confidence, "escalated": False}

This much is what many people already run. The open question is how you choose and maintain that final threshold. Self-reported confidence is convenient, but there is no guarantee the model estimates "its own confidence" accurately. That is precisely why you need a mechanism to check that confidence from the outside.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Catch the silent decay of a hand-set Flash↔Pro confidence threshold as an actual number, using shadow re-evaluation
Get copy-paste Python that back-solves the threshold from a target disagreement rate (your quality budget)
Build a nightly job that proposes a new threshold, with hysteresis and a cost ceiling so it never runs away
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-06-16
Don't Break When the Default Model Moves: A Startup Capability-Probing Layer for Gemini
Pinning a model name breaks on deprecation; trusting the default breaks when the weights swap silently. This is the design I settled on: probe what the served model can actually do at startup, then build every request from that answer. Includes runnable Python.
API / SDK2026-06-15
When the Default Model Silently Upgrades: Catching Prompt Regressions in Numbers
Gemini 3.5 Flash is now the default and you can no longer turn it off. Assuming your responses can shift without you touching the prompt, here is how to bundle prompt, model, and sampling into one variant and catch regressions with canaries and an LLM judge — in working code.
API / SDK2026-06-14
Controlling Image Tokens with the Gemini API media_resolution Setting — Tuning Batch Image Classification by Measurement
media_resolution, introduced in the Gemini 3 line, switches how many tokens an image input consumes across three levels. Through real batch-classification measurements, this guide shows how to balance cost and accuracy by assigning the right tier per task.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →