Generate With Flash, Escalate to Deep Think Only When Unsure: A Two-Stage Pipeline

The first thing I thought when I heard Deep Think was partly opened on the API was: "Routing everything through it raises quality, but the bill stops being realistic." A heavy reasoning model is more accurate, but its per-token price and latency are an order of magnitude away from Flash. As an indie developer running an automation pipeline every day, that gap hits the end-of-month invoice directly.

So the design I landed on generates with cheap Flash and has Deep Think verify only when the output looks shaky. Calling heavy reasoning "only when needed" rather than "always" is how I aimed to raise the floor on accuracy without blowing up cost.

Why not route every request through the heavy model

Real-world requests are not uniform in difficulty. Most return a perfectly correct answer from Flash, and only a fraction are hard cases that warrant deliberation. Routing all of them through Deep Think means paying the high rate for the easy 90% too.

The two-stage idea is "answer first with the cheap model, send only the low-confidence ones to the expensive model." The key is to have Flash itself declare whether it is confident, and use that declaration as the mechanical basis for routing. Only once escalation runs without a human watching does it work as an automated pipeline.

Stage one: have Flash emit an answer and a confidence together

The point is to have Flash's response include not just the "answer" but a structured "confidence" and "reason for doubt." Receiving JSON lets the downstream routing be written without any string parsing.

// stage1-flash.mjs
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
 
const flash = genAI.getGenerativeModel({
  model: "gemini-3.5-flash",
  generationConfig: { responseMimeType: "application/json" },
});
 
export async function draftWithConfidence(question) {
  const prompt = `Answer the question and return JSON only.
Shape: {"answer": string, "confidence": number(0-1), "uncertain_reason": string}
confidence is your certainty in your answer. Declare it lower when the question is ambiguous, under-specified, or involves calculation.
Question: ${question}`;
 
  const res = await flash.generateContent(prompt);
  return JSON.parse(res.response.text());
}

Specifying responseMimeType: "application/json" makes Flash less likely to mix in a non-JSON preamble. Self-reported confidence is not a cure-all, but Flash does declare structurally hard cases—"calculation involved," "premises missing"—reasonably honestly low. Use it as a first-pass filter for routing, not as a perfect metric.

Routing: let confidence and a threshold decide escalation

Whether to call stage two is decided mechanically by a confidence threshold. Expecting to tune it in operation, keep it in a constant.

// router.mjs
const ESCALATE_BELOW = 0.7;   // below this confidence, send to Deep Think
 
export function needsDeepThink(draft) {
  if (draft.confidence < ESCALATE_BELOW) return true;
  // even at high confidence, escalate if the reason field shows a red flag
  const risky = ["uncertain", "premise", "guess", "calculat", "latest"];
  return risky.some(w => (draft.uncertain_reason || "").toLowerCase().includes(w));
}

Not relying on the threshold alone—pairing it with keywords in uncertain_reason—is for catching "overconfident wrong answers" where confidence comes out high. In my runs, a plain 0.7 threshold escalated about 20% of the total, and questions involving calculation or fresh information concentrated there.

Stage two: ask Deep Think only to verify

What matters in stage two is having Deep Think verify whether Flash's answer is right, not re-answer from scratch. Narrowing it to a verification task keeps the output short, saves tokens, and yields a rationale for the decision.

// stage2-deepthink.mjs
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
 
const deep = genAI.getGenerativeModel({ model: "gemini-3-deep-think" });
 
export async function verify(question, draft) {
  const prompt = `Question: ${question}
First answer: ${draft.answer}
Verify whether this first answer is correct and return JSON.
Shape: {"verdict":"ok"|"fix","final":string,"why":string}
If there is an error, put the corrected answer in final.`;
 
  const res = await deep.generateContent(prompt);
  return JSON.parse(res.response.text());
}

Wiring the two: only the doubtful ones touch the expensive model

Finally, wire it together. Flash answers first, and only the ones routing flags get passed to Deep Think.

// pipeline.mjs
import { draftWithConfidence } from "./stage1-flash.mjs";
import { needsDeepThink } from "./router.mjs";
import { verify } from "./stage2-deepthink.mjs";
 
export async function answer(question) {
  const draft = await draftWithConfidence(question);
  if (!needsDeepThink(draft)) {
    return { answer: draft.answer, path: "flash-only" };
  }
  const checked = await verify(question, draft);
  return {
    answer: checked.verdict === "fix" ? checked.final : draft.answer,
    path: "escalated",
    note: checked.why,
  };
}

Keeping path in the result is so you can later tally "what share got escalated." If that ratio is higher than expected, the threshold is too strict; too low, and you may be missing wrong answers. The cost-quality balance can be monitored with this one number.

How to set the threshold, and field notes

The surest way to set the initial threshold was to prepare 20–30 harder questions, hand-label the correct answers, and decide while watching both the escalation rate and the accuracy. Tuning straight on production traffic prolongs a state where you sacrifice either cost or quality.

What paid off in operation was logging Deep Think's verification results and later analyzing where Flash tends to err. If verdict: "fix" skews toward a particular category, it is cheaper in the end to reinforce the stage-one prompt there, or send that category straight to stage two. The two-stage setup is not fixed; I treat it as a base for continuously improving routing accuracy.

Start with the minimal setup: a confidence threshold of 0.7 and the red-flag keywords only. Before routing everything through the heavy model, measure how much this two-stage approach saves, and you will get a feel for using Deep Think exactly when it counts. I hope it helps your implementation.