GEMINI LABJP
FLASH — Gemini 3.5 Flash is now generally available, billed as the most intelligent model for agentic and coding tasksTIER — New tiers like 3.1 Pro and 3.1 Flash-Lite are rolling into apps, cloud products, and business toolsPIXEL — The June Pixel Drop adds Gemini music generation, AI video and music creation, and screen-recording reactionsOMNI — Gemini Omni (creation), 3 Deep Think (reasoning), and Deep Research (automation) all advance in parallelLIVE — Gemini Live's real-time interaction is expanding across Android, Search, YouTube, and connected Google appsULTRA — Google AI Ultra offers top model access, Deep Research, Veo 3 video, and a 1M-token context windowFLASH — Gemini 3.5 Flash is now generally available, billed as the most intelligent model for agentic and coding tasksTIER — New tiers like 3.1 Pro and 3.1 Flash-Lite are rolling into apps, cloud products, and business toolsPIXEL — The June Pixel Drop adds Gemini music generation, AI video and music creation, and screen-recording reactionsOMNI — Gemini Omni (creation), 3 Deep Think (reasoning), and Deep Research (automation) all advance in parallelLIVE — Gemini Live's real-time interaction is expanding across Android, Search, YouTube, and connected Google appsULTRA — Google AI Ultra offers top model access, Deep Research, Veo 3 video, and a 1M-token context window
Articles/Dev Tools
Dev Tools/2026-06-20Intermediate

Routing Gemini by Pipeline Stage: Draft on Flash, Finish on the Top Tier

A record of reworking which Gemini model handles which stage of an automation pipeline, prompted by the general availability of Gemini 3.5 Flash and the rollout of 3.1 Flash-Lite. Includes a small router that splits work into draft, classify, and finalize stages, how the cost picture changes, and the guardrails I settled on.

Gemini 3.5 Flash2model selection3model routingcost optimization6automation40indie developer11Gemini API143

When you run several blogs on autopilot, you call Gemini many times a day. Right after Gemini 3.5 Flash reached general availability (GA) and 3.1 Flash-Lite started spreading across products, I took the chance to audit which model I was assigning to which stage of my pipeline. What I noticed was that the naive setup—"just send everything to the smartest model"—carried more waste than I had assumed.

Here I want to walk through splitting a single job into three stages—draft, classify, and finalize—and assigning a different tier to each, together with the actual router code. Rather than jumping straight to the newest top model for everything, dividing the work by role and running it quietly turns out to be the more dependable choice for indie developer automation. That is my own honest takeaway.

Why I stopped using one model for everything

At first I handed everything to a single top tier without a second thought. Quality was stable, but two friction points surfaced as I kept the pipeline running.

The first was cost. I was sending not only article generation but also light judgments—"which category is this?", "give me five tag candidates"—to the top tier, so high-priced-per-token work made up most of the day's calls. Those judgments run with plenty of accuracy on Flash-Lite.

The second was latency. When even short tasks like classification incur the top tier's reasoning time, the downstream stages wait for it. Handing the lighter work to a faster model raises the throughput of the whole pipeline.

Conversely, the one thing I never want to compromise is the final polish—the quality of the text readers actually read. Dropping that stage to a lighter model to save money breaks quality exactly where it matters most. Since each stage asks for something different, it follows naturally that the models assigned to them should differ too.

Splitting the work into three stages

Here is the assignment I settled on. It is only one example, but it is a useful starting point for the way of thinking.

StageWhat it needsAssigned tierWhy
Draft / idea generationSpeed and volume; rough is fineFlash-LiteIt gets rewritten later, so run it cheap and fast
Classify / tag / formatStable judgment in a fixed shapeFlashGood at following structured output, modest price
Finalize / final editQuality and consistency of proseTop tier (Pro class)What readers read; never compromise here

The key is to decide up front which stages you may economize on and which you may not. Draft and classify can run on cheaper models, but judging the finalize stage by unit price alone leads to regret. The savings come mostly from the first two stages, so you can let the last one prioritize quality without worry.

A router that picks the model per stage

It helps to insert a small router that takes a stage name and returns the right model name and generation settings. The aim is to keep model names out of the scattered parts of your code and gather them in one place. When models swap out at GA or deprecation, this is the only spot you touch.

// stageRouter.js — assign a model and settings per stage
const STAGE_CONFIG = {
  draft:    { model: "gemini-3.1-flash-lite", temperature: 0.9, maxOutputTokens: 1200 },
  classify: { model: "gemini-3.5-flash",      temperature: 0.0, maxOutputTokens: 256  },
  finalize: { model: "gemini-3.5-pro",        temperature: 0.4, maxOutputTokens: 4000 },
};
 
// Rough unit prices (USD per million tokens; confirm actual values on the pricing page)
const PRICE_PER_MTOK = {
  "gemini-3.1-flash-lite": { in: 0.10, out: 0.40 },
  "gemini-3.5-flash":      { in: 0.30, out: 2.50 },
  "gemini-3.5-pro":        { in: 1.25, out: 10.0 },
};
 
export function pickStage(stage) {
  const cfg = STAGE_CONFIG[stage];
  if (!cfg) throw new Error(`Undefined stage: ${stage}`);
  return cfg;
}
 
export function estimateCost(stage, inTokens, outTokens) {
  const { model } = pickStage(stage);
  const p = PRICE_PER_MTOK[model];
  return (inTokens / 1e6) * p.in + (outTokens / 1e6) * p.out;
}

The call site only has to name the stage.

import { GoogleGenAI } from "@google/genai";
import { pickStage, estimateCost } from "./stageRouter.js";
 
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
 
async function runStage(stage, prompt) {
  const { model, temperature, maxOutputTokens } = pickStage(stage);
  const res = await ai.models.generateContent({
    model,
    contents: prompt,
    config: { temperature, maxOutputTokens },
  });
  const usage = res.usageMetadata ?? {};
  const cost = estimateCost(stage, usage.promptTokenCount ?? 0, usage.candidatesTokenCount ?? 0);
  console.log(`[${stage}] ${model} est. $${cost.toFixed(5)}`);
  return res.text;
}

With this in place, the code that writes the body no longer thinks about model names. You can express only the intent of the stage—runStage("draft", ...)—which makes the flow easy to follow when you reread it later.

How the cost picture changes

The first thing splitting by stage buys you is visibility into the cost breakdown. If you log estimateCost, biases show up clearly as numbers: "classification is 70% of the day's call count but under 10% of the spend."

Imagine a pipeline that runs 40 drafts, 120 classifications, and 40 finalizations a day. Classification happens often, but with short outputs and Flash's modest price, its contribution to the total stays small. Finalization, even at a lower count, becomes the center of the bill thanks to long outputs and the top tier's unit price. Once you can see this structure, you judge "where cutting helps and where it doesn't" from numbers rather than intuition.

In my case, compared with the days when I sent every light judgment to the top tier, I compressed the API cost visibly without lowering output quality. The only tiers I trimmed were draft and classify; I left the finalize stage that readers actually read completely untouched.

Two guardrails I settled on after running it

After a while, I added a couple of minimal guardrails to the router.

The first is to keep the fallback direction "up," not "down." When the finalize model is temporarily unresponsive, automatically dropping to a cheaper lower tier quietly breaks quality at the very stage you most want to protect. So instead of dropping on a finalize failure, I retry after a short wait, and if it still fails, I stop and log it. This encodes, in the router itself, the rule I set at the start: never sacrifice finalize quality for cost.

The second is to always write deprecation dates into the router's comments. Gemini swaps models quickly, and some—like the image preview models—have fixed shutdown dates. With the assignments gathered in one place, you only swap that one spot before the date arrives. When you run several sites in parallel, having "a single place to swap" is what keeps operations calm.

Stage-based routing is not a special mechanism. But simply putting "what each stage asks for" into words once, and leaving it in code as a router, makes both cost decisions and model swaps far steadier to handle. Start by finding the highest-call-count step in your own pipeline and asking whether it truly needs the top tier.

Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

Dev Tools2026-06-18
Keeping Nightly Batches Alive After the Gemini CLI Stops Responding: A google-genai SDK Fallback
On June 18 the Gemini CLI stops answering requests. Here is a small fallback harness that probes whether the CLI can still respond and quietly reroutes unattended batch jobs to the google-genai SDK, built from my own automation.
Dev Tools2026-05-28
Replaying Gemini API Calls Locally with msw and HTTP Fixtures — How I Cut API Quota Across Six Sites with a Record/Replay Pattern
When you hit the real Gemini API every time you tweak a piece of UI, you end up paying for hundreds of duplicate calls a day. After moving six of my sites to an msw-based record/replay pattern, monthly Gemini billing went unexpectedly quiet. Here's the implementation and operating policy.
Dev Tools2026-05-24
Running Streamlit + Gemini as a Production BI Dashboard — Auth, Cost, Caching, Rate Limits, Observability
A design memo for promoting a Streamlit + Gemini data analysis app into a real multi-user internal BI dashboard — authentication, cost optimization, result caching, per-user rate limits, and observability, all from production experience.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →