When you run several blogs on autopilot, you call Gemini many times a day. Right after Gemini 3.5 Flash reached general availability (GA) and 3.1 Flash-Lite started spreading across products, I took the chance to audit which model I was assigning to which stage of my pipeline. What I noticed was that the naive setup—"just send everything to the smartest model"—carried more waste than I had assumed.
Here I want to walk through splitting a single job into three stages—draft, classify, and finalize—and assigning a different tier to each, together with the actual router code. Rather than jumping straight to the newest top model for everything, dividing the work by role and running it quietly turns out to be the more dependable choice for indie developer automation. That is my own honest takeaway.
Why I stopped using one model for everything
At first I handed everything to a single top tier without a second thought. Quality was stable, but two friction points surfaced as I kept the pipeline running.
The first was cost. I was sending not only article generation but also light judgments—"which category is this?", "give me five tag candidates"—to the top tier, so high-priced-per-token work made up most of the day's calls. Those judgments run with plenty of accuracy on Flash-Lite.
The second was latency. When even short tasks like classification incur the top tier's reasoning time, the downstream stages wait for it. Handing the lighter work to a faster model raises the throughput of the whole pipeline.
Conversely, the one thing I never want to compromise is the final polish—the quality of the text readers actually read. Dropping that stage to a lighter model to save money breaks quality exactly where it matters most. Since each stage asks for something different, it follows naturally that the models assigned to them should differ too.
Splitting the work into three stages
Here is the assignment I settled on. It is only one example, but it is a useful starting point for the way of thinking.
| Stage | What it needs | Assigned tier | Why |
|---|---|---|---|
| Draft / idea generation | Speed and volume; rough is fine | Flash-Lite | It gets rewritten later, so run it cheap and fast |
| Classify / tag / format | Stable judgment in a fixed shape | Flash | Good at following structured output, modest price |
| Finalize / final edit | Quality and consistency of prose | Top tier (Pro class) | What readers read; never compromise here |
The key is to decide up front which stages you may economize on and which you may not. Draft and classify can run on cheaper models, but judging the finalize stage by unit price alone leads to regret. The savings come mostly from the first two stages, so you can let the last one prioritize quality without worry.
A router that picks the model per stage
It helps to insert a small router that takes a stage name and returns the right model name and generation settings. The aim is to keep model names out of the scattered parts of your code and gather them in one place. When models swap out at GA or deprecation, this is the only spot you touch.
// stageRouter.js — assign a model and settings per stage
const STAGE_CONFIG = {
draft: { model: "gemini-3.1-flash-lite", temperature: 0.9, maxOutputTokens: 1200 },
classify: { model: "gemini-3.5-flash", temperature: 0.0, maxOutputTokens: 256 },
finalize: { model: "gemini-3.5-pro", temperature: 0.4, maxOutputTokens: 4000 },
};
// Rough unit prices (USD per million tokens; confirm actual values on the pricing page)
const PRICE_PER_MTOK = {
"gemini-3.1-flash-lite": { in: 0.10, out: 0.40 },
"gemini-3.5-flash": { in: 0.30, out: 2.50 },
"gemini-3.5-pro": { in: 1.25, out: 10.0 },
};
export function pickStage(stage) {
const cfg = STAGE_CONFIG[stage];
if (!cfg) throw new Error(`Undefined stage: ${stage}`);
return cfg;
}
export function estimateCost(stage, inTokens, outTokens) {
const { model } = pickStage(stage);
const p = PRICE_PER_MTOK[model];
return (inTokens / 1e6) * p.in + (outTokens / 1e6) * p.out;
}The call site only has to name the stage.
import { GoogleGenAI } from "@google/genai";
import { pickStage, estimateCost } from "./stageRouter.js";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
async function runStage(stage, prompt) {
const { model, temperature, maxOutputTokens } = pickStage(stage);
const res = await ai.models.generateContent({
model,
contents: prompt,
config: { temperature, maxOutputTokens },
});
const usage = res.usageMetadata ?? {};
const cost = estimateCost(stage, usage.promptTokenCount ?? 0, usage.candidatesTokenCount ?? 0);
console.log(`[${stage}] ${model} est. $${cost.toFixed(5)}`);
return res.text;
}With this in place, the code that writes the body no longer thinks about model names. You can express only the intent of the stage—runStage("draft", ...)—which makes the flow easy to follow when you reread it later.
How the cost picture changes
The first thing splitting by stage buys you is visibility into the cost breakdown. If you log estimateCost, biases show up clearly as numbers: "classification is 70% of the day's call count but under 10% of the spend."
Imagine a pipeline that runs 40 drafts, 120 classifications, and 40 finalizations a day. Classification happens often, but with short outputs and Flash's modest price, its contribution to the total stays small. Finalization, even at a lower count, becomes the center of the bill thanks to long outputs and the top tier's unit price. Once you can see this structure, you judge "where cutting helps and where it doesn't" from numbers rather than intuition.
In my case, compared with the days when I sent every light judgment to the top tier, I compressed the API cost visibly without lowering output quality. The only tiers I trimmed were draft and classify; I left the finalize stage that readers actually read completely untouched.
Two guardrails I settled on after running it
After a while, I added a couple of minimal guardrails to the router.
The first is to keep the fallback direction "up," not "down." When the finalize model is temporarily unresponsive, automatically dropping to a cheaper lower tier quietly breaks quality at the very stage you most want to protect. So instead of dropping on a finalize failure, I retry after a short wait, and if it still fails, I stop and log it. This encodes, in the router itself, the rule I set at the start: never sacrifice finalize quality for cost.
The second is to always write deprecation dates into the router's comments. Gemini swaps models quickly, and some—like the image preview models—have fixed shutdown dates. With the assignments gathered in one place, you only swap that one spot before the date arrives. When you run several sites in parallel, having "a single place to swap" is what keeps operations calm.
Stage-based routing is not a special mechanism. But simply putting "what each stage asks for" into words once, and leaving it in code as a router, makes both cost decisions and model swaps far steadier to handle. Start by finding the highest-call-count step in your own pipeline and asking whether it truly needs the top tier.