GEMINI LABJP
FLASH GA — Gemini 3.5 Flash is now generally available, billed as the most intelligent model for sustained frontier performance on agentic and coding tasksTOGGLE — From Jun 16 the Gemini 3.5 Flash feature toggle is removed in the Global, US, and EU multi-regions, so check any configs that depend on itAGENTS — Managed Agents launched in public preview, letting developers build and deploy autonomous, stateful agents inside Google-hosted isolated Linux sandboxesIMAGE — The image preview models gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25; migrate to their successorsSEARCH — File Search now supports multimodal search, natively embedding and searching images via the gemini-embedding-2 modelCLI — Gemini CLI and Code Assist end individual access on Jun 18; free users and AI Pro/Ultra subscribers are directed to the Antigravity CLIFLASH GA — Gemini 3.5 Flash is now generally available, billed as the most intelligent model for sustained frontier performance on agentic and coding tasksTOGGLE — From Jun 16 the Gemini 3.5 Flash feature toggle is removed in the Global, US, and EU multi-regions, so check any configs that depend on itAGENTS — Managed Agents launched in public preview, letting developers build and deploy autonomous, stateful agents inside Google-hosted isolated Linux sandboxesIMAGE — The image preview models gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25; migrate to their successorsSEARCH — File Search now supports multimodal search, natively embedding and searching images via the gemini-embedding-2 modelCLI — Gemini CLI and Code Assist end individual access on Jun 18; free users and AI Pro/Ultra subscribers are directed to the Antigravity CLI
Articles/API / SDK
API / SDK/2026-06-14Intermediate

Generate With Flash, Escalate to Deep Think Only When Unsure: A Two-Stage Pipeline

With Deep Think opening up on the API, the move is not to route every request through the heavy model but to have Deep Think verify only when Flash's output looks shaky. Here is the cost reasoning and working code.

Gemini API135Deep Think2Flash2Cost Optimization9Quality Gate

The first thing I thought when I heard Deep Think was partly opened on the API was: "Routing everything through it raises quality, but the bill stops being realistic." A heavy reasoning model is more accurate, but its per-token price and latency are an order of magnitude away from Flash. As an indie developer running an automation pipeline every day, that gap hits the end-of-month invoice directly.

So the design I landed on generates with cheap Flash and has Deep Think verify only when the output looks shaky. Calling heavy reasoning "only when needed" rather than "always" is how I aimed to raise the floor on accuracy without blowing up cost.

Why not route every request through the heavy model

Real-world requests are not uniform in difficulty. Most return a perfectly correct answer from Flash, and only a fraction are hard cases that warrant deliberation. Routing all of them through Deep Think means paying the high rate for the easy 90% too.

The two-stage idea is "answer first with the cheap model, send only the low-confidence ones to the expensive model." The key is to have Flash itself declare whether it is confident, and use that declaration as the mechanical basis for routing. Only once escalation runs without a human watching does it work as an automated pipeline.

Stage one: have Flash emit an answer and a confidence together

The point is to have Flash's response include not just the "answer" but a structured "confidence" and "reason for doubt." Receiving JSON lets the downstream routing be written without any string parsing.

// stage1-flash.mjs
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
 
const flash = genAI.getGenerativeModel({
  model: "gemini-3.5-flash",
  generationConfig: { responseMimeType: "application/json" },
});
 
export async function draftWithConfidence(question) {
  const prompt = `Answer the question and return JSON only.
Shape: {"answer": string, "confidence": number(0-1), "uncertain_reason": string}
confidence is your certainty in your answer. Declare it lower when the question is ambiguous, under-specified, or involves calculation.
Question: ${question}`;
 
  const res = await flash.generateContent(prompt);
  return JSON.parse(res.response.text());
}

Specifying responseMimeType: "application/json" makes Flash less likely to mix in a non-JSON preamble. Self-reported confidence is not a cure-all, but Flash does declare structurally hard cases—"calculation involved," "premises missing"—reasonably honestly low. Use it as a first-pass filter for routing, not as a perfect metric.

Routing: let confidence and a threshold decide escalation

Whether to call stage two is decided mechanically by a confidence threshold. Expecting to tune it in operation, keep it in a constant.

// router.mjs
const ESCALATE_BELOW = 0.7;   // below this confidence, send to Deep Think
 
export function needsDeepThink(draft) {
  if (draft.confidence < ESCALATE_BELOW) return true;
  // even at high confidence, escalate if the reason field shows a red flag
  const risky = ["uncertain", "premise", "guess", "calculat", "latest"];
  return risky.some(w => (draft.uncertain_reason || "").toLowerCase().includes(w));
}

Not relying on the threshold alone—pairing it with keywords in uncertain_reason—is for catching "overconfident wrong answers" where confidence comes out high. In my runs, a plain 0.7 threshold escalated about 20% of the total, and questions involving calculation or fresh information concentrated there.

Stage two: ask Deep Think only to verify

What matters in stage two is having Deep Think verify whether Flash's answer is right, not re-answer from scratch. Narrowing it to a verification task keeps the output short, saves tokens, and yields a rationale for the decision.

// stage2-deepthink.mjs
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
 
const deep = genAI.getGenerativeModel({ model: "gemini-3-deep-think" });
 
export async function verify(question, draft) {
  const prompt = `Question: ${question}
First answer: ${draft.answer}
Verify whether this first answer is correct and return JSON.
Shape: {"verdict":"ok"|"fix","final":string,"why":string}
If there is an error, put the corrected answer in final.`;
 
  const res = await deep.generateContent(prompt);
  return JSON.parse(res.response.text());
}

Wiring the two: only the doubtful ones touch the expensive model

Finally, wire it together. Flash answers first, and only the ones routing flags get passed to Deep Think.

// pipeline.mjs
import { draftWithConfidence } from "./stage1-flash.mjs";
import { needsDeepThink } from "./router.mjs";
import { verify } from "./stage2-deepthink.mjs";
 
export async function answer(question) {
  const draft = await draftWithConfidence(question);
  if (!needsDeepThink(draft)) {
    return { answer: draft.answer, path: "flash-only" };
  }
  const checked = await verify(question, draft);
  return {
    answer: checked.verdict === "fix" ? checked.final : draft.answer,
    path: "escalated",
    note: checked.why,
  };
}

Keeping path in the result is so you can later tally "what share got escalated." If that ratio is higher than expected, the threshold is too strict; too low, and you may be missing wrong answers. The cost-quality balance can be monitored with this one number.

How to set the threshold, and field notes

The surest way to set the initial threshold was to prepare 20–30 harder questions, hand-label the correct answers, and decide while watching both the escalation rate and the accuracy. Tuning straight on production traffic prolongs a state where you sacrifice either cost or quality.

What paid off in operation was logging Deep Think's verification results and later analyzing where Flash tends to err. If verdict: "fix" skews toward a particular category, it is cheaper in the end to reinforce the stage-one prompt there, or send that category straight to stage two. The two-stage setup is not fixed; I treat it as a base for continuously improving routing accuracy.

Start with the minimal setup: a confidence threshold of 0.7 and the red-flag keywords only. Before routing everything through the heavy model, measure how much this two-stage approach saves, and you will get a feel for using Deep Think exactly when it counts. I hope it helps your implementation.

Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

API / SDK2026-06-03
Reconciling Orphaned Gemini Files API Uploads Across a Fleet of Apps
Files API uploads quietly expire after 48 hours. Here's how I keep orphaned files and quota under control across six apps, using reconciliation against my own database and a scheduled cleanup job — written up as production notes from running wallpaper apps.
API / SDK2026-05-24
Apple Vision Framework × Gemini API: Hybrid Image Recognition — Cutting Wallpaper App Cloud Inference Costs by 70%
How I built an on-device prefilter with Apple Vision Framework to cut Gemini Vision API calls by more than half in my iOS wallpaper app. Real cost, accuracy, and latency numbers, with the gotchas an indie developer hits along the way.
API / SDK2026-03-24
Gemini API Context Caching— Cut Document Processing Costs by 90%
Learn how to use Gemini API's context caching to reduce repetitive document processing costs by up to 90%. Includes Python SDK implementation, caching strategies, and cost calculations.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →