◈ API / SDK/2026-06-14Advanced

How a Deep Think Verification Step Tripled My API Bill, and How thinking_level Got It Back

After wiring API-accessible Gemini 3 Deep Think into my output-verification step, my projected monthly cost jumped roughly 3x. Here is the implementation record of capping it with thinking_level and a cost guardrail, then settling on a two-stage design with Flash.

gemini⁸⁶ deep-think⁴ gemini-api²⁴³ cost⁴ reasoning⁶

✦ Premium Article

The week after I added Gemini 3 Deep Think to my verification step, my projected API bill was about three times the usual.

As an indie developer, I automate article generation for four technical blogs (the Dolice Labs sites), and I run a pre-publish step that mechanically checks each draft for overstated claims and broken code examples. I used to have Flash do the grading. When the June 2026 update opened up partial API access to Deep Think, I simply swapped it in, hoping for sharper judgments.

The accuracy did improve. The cost increase, however, went well beyond what I expected. Deep Think runs a long internal reasoning pass before it answers, and those reasoning tokens hit your bill separately from input and output. Verification runs dozens of times a day, so the per-call difference compounds quickly.

This is the record of how I reined that in with thinking_level and a cost guardrail, and eventually landed on a two-stage design with Flash. The goal was to treat Deep Think as a smart-but-expensive tool and draw a clear line around where it actually earns its cost.

Why Deep Think verification gets expensive

For an ordinary model call, you can estimate cost roughly as "input tokens + output tokens." But a deep reasoning model like Deep Think unfolds a long chain of thought before writing its final answer. That thinking is computation, and it is billable.

Verification turned out to be a poor fit for this. A question like "does this article overstate anything?" looks, to Deep Think, like a problem worth pondering, so it digs in on its own. The output I need is just a short "OK" or "needs revision," yet the reasoning leading up to it balloons.

In other words, Deep Think's verification cost is not something you can infer from the short output. The hidden reasoning tokens are the main act, and unless you control them, your per-call price never stabilizes.

Capping reasoning depth with thinking_level

The first thing that helped was putting a ceiling on the depth of reasoning itself. The Gemini 3 family lets you set how hard the model thinks via thinking_level. For a task like verification, where the correct answer is short and the criteria are clear, you do not need maximum deliberation.

from google import genai
from google.genai import types
 
# Reads GEMINI_API_KEY from the environment
client = genai.Client()
 
def verify_article(article_text: str) -> str:
    """Verifies an article and returns a short result containing JUDGE: OK / JUDGE: REVISE."""
    prompt = (
        "You are a copy editor doing fact-checking for technical articles. "
        "Decide whether the following article contains overstatement, broken code "
        "examples, or clear factual errors. Write 'JUDGE: OK' on the first line if "
        "it is fine, or 'JUDGE: REVISE' if it needs changes, then give the reason "
        "in at most three lines.\n\n---\n" + article_text
    )
 
    response = client.models.generate_content(
        model="gemini-3-deep-think",
        contents=prompt,
        config=types.GenerateContentConfig(
            # Verification does not need deep deliberation, so pin it to low
            thinking_config=types.ThinkingConfig(thinking_level="low"),
            # Keep the output short to avoid wasted long-form text
            max_output_tokens=200,
        ),
    )
    return response.text
 
print(verify_article("(article body here)"))
# Example expected output:
#   JUDGE: REVISE
#   The third code example calls client.generate(); it should be client.models.generate_content().

Leaving thinking_level at high was the single biggest cause of the cost blow-up. Dropping it to low barely changed verification accuracy, yet sharply reduced the thinking tokens per call. A design or math problem that genuinely warrants deep reasoning is nothing like a task that just returns a short verdict; the amount of thinking they need is completely different.

In this situation I prefer to make low the default and only raise specific "gray" cases to high, which I'll cover below.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If Deep Think blew up your verification costs, you'll learn to cap them with thinking_level and a guardrail

✦You'll get copy-paste code for a two-stage design: Flash judges first, Deep Think only handles the gray cases

✦You'll understand how reasoning tokens land on your bill, so you can price each verification call

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Trimming the context you hand to verification

The next win was making the input itself thinner. My first implementation passed the full writing guidelines and excerpts from past articles alongside the draft, on the assumption that more material would yield a more accurate judgment.

The opposite was true. The longer the context, the more Deep Think reads it, reasons about its relevance, and extends its thinking. All the judgment actually needed was the article body and a few lines of compressed criteria.

# Bad and heavy: shipping the full guidelines (thousands of tokens) every time
RULES_FULL = open("writing_guidelines_full.md").read()
 
# Good and light: compress only the points that matter for judging
RULES_COMPACT = (
    "Check for: (1) exaggerated numbers or achievements "
    "(2) code examples that are syntactically valid "
    "(3) assertions that contradict the official spec"
)
 
def build_prompt(article_text: str) -> str:
    return f"{RULES_COMPACT}\n\nJudge the following article.\n---\n{article_text}"

Instead of handing over the rulebook verbatim, summarize just the points that drive the verdict into a few lines first. That alone lowered both the input tokens and the thinking tokens they drag along. In a verification context, I've found that narrowing the decision criteria beats adding information, for both accuracy and cost.

Implementing a cost-ceiling guardrail

thinking_level and a leaner input brought the per-call price down, but I still wanted a structural ceiling for when an unusually long article arrives or the call volume spikes. Verification runs automatically, so noticing a runaway at billing time is far too late.

So I added a guardrail that reads the tokens consumed per verification from usage_metadata and watches the running total against a daily budget.

# Daily verification budget (USD). Past this, stop using Deep Think.
DAILY_BUDGET_USD = 2.0
 
# Approximate rates (replace with the real price sheet).
# Reasoning tokens are approximated at the output rate.
PRICE_PER_1K_INPUT = 0.0003
PRICE_PER_1K_OUTPUT = 0.0025
 
_spent_today = 0.0
 
def estimate_cost(usage) -> float:
    in_tok = usage.prompt_token_count or 0
    # Sum thinking tokens and output tokens, approximated at the output rate
    out_tok = (usage.candidates_token_count or 0) + (usage.thoughts_token_count or 0)
    return (in_tok / 1000) * PRICE_PER_1K_INPUT + (out_tok / 1000) * PRICE_PER_1K_OUTPUT
 
def verify_with_guardrail(article_text: str):
    global _spent_today
    if _spent_today >= DAILY_BUDGET_USD:
        # Over budget: skip Deep Think and fall back to cheaper Flash
        return flash_fallback(article_text), "fallback"
 
    response = client.models.generate_content(
        model="gemini-3-deep-think",
        contents=build_prompt(article_text),
        config=types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_level="low"),
            max_output_tokens=200,
        ),
    )
    _spent_today += estimate_cost(response.usage_metadata)
    return response.text, "deep-think"
 
def flash_fallback(article_text: str):
    r = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=build_prompt(article_text),
        config=types.GenerateContentConfig(max_output_tokens=200),
    )
    return r.text

The key point here is to always include thoughts_token_count in the estimate. If you budget by output tokens alone, you'll badly underestimate Deep Think's real consumption and your guardrail won't do its job. Count the hidden tokens explicitly. That was the crux of getting Deep Think into production.

Going two-stage: Flash first, Deep Think only for the gray cases

The biggest win, in the end, was to stop sending everything through Deep Think. Most of what I verify is straightforward enough that Flash can confidently return "OK." Deep Think only earned its keep on the gray cases where Flash couldn't commit to either "OK" or "REVISE."

So I had Flash return a verdict together with a confidence score, and only escalated to Deep Think when confidence was low.

import json
 
def first_pass_flash(article_text: str) -> dict:
    """First-pass judgment with Flash, returning verdict and confidence (0.0-1.0) as JSON."""
    prompt = (
        build_prompt(article_text)
        + "\n\nReturn the result as JSON. "
          '{"judge": "OK|REVISE", "confidence": 0.0-1.0}'
    )
    r = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=prompt,
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            max_output_tokens=120,
        ),
    )
    return json.loads(r.text)
 
def verify_two_stage(article_text: str):
    first = first_pass_flash(article_text)
    # High confidence: accept Flash's verdict, never call Deep Think
    if first["confidence"] >= 0.8:
        return first["judge"], "flash"
    # Escalate only the gray cases to Deep Think
    result, source = verify_with_guardrail(article_text)
    judge = "OK" if "JUDGE: OK" in result else "REVISE"
    return judge, source

With this in place, the share of drafts routed to Deep Think settled at around 20%. Use the expensive tool only on the genuinely hard cases where accuracy matters, and clear the rest fast with the cheap one. It sounds obvious, yet I'd initially reasoned "I want more accuracy, so run everything through Deep Think," and it took a small detour to see it.

The two-stage idea itself is also covered in a separate piece, Generate With Flash, Escalate to Deep Think Only When Unsure. For deciding which workloads to move to a newer model first, Where to Adopt Gemini 3.5 Flash GA First is a useful companion.

Three pitfalls I hit

1. Forgetting to set thinking_level, so it ran at the default depth

If you call without passing thinking_config, the model thinks at its default depth. For a light task like verification, that tends to be excessive. Unless you set low explicitly, it "thinks where it doesn't need to," so suspect a missing setting first.

2. Leaving thinking tokens out of the cost estimate

For a while I computed daily cost from output tokens alone, ignoring thoughts_token_count in usage_metadata. That makes the guardrail too loose and diverges from the actual bill. If you use Deep Think, always add thinking tokens to the estimate.

3. Binding Deep Think too tightly to JSON output

Before going two-stage, I also required a strict JSON schema from Deep Think, but its deep reasoning trace collided with schema formatting and occasionally returned broken JSON. It was more stable to let Deep Think return a short verdict in plain text and leave machine-readable formatting to Flash.

The numbers that actually moved

Across the switch, the average cost per verification dropped to roughly a third. That breaks down into the effect of lowering thinking_level to low, trimming the input context, and the two-stage design narrowing Deep Think calls to about 20% of the total. Accuracy actually went up, since it now catches the "plausible but wrong" code examples that Flash alone had been missing.

More valuable than the numbers was that the cost ceiling became predictable. With a daily budget and a guardrail, the bill no longer runs away as verification volume grows. Being able to run the automation without worry was the real payoff.

If you're about to fold Deep Think into verification, start by setting thinking_level="low" explicitly and writing one small guardrail that includes thoughts_token_count in its estimate. Just nailing that down changes how readable your costs become.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.