⬡ Advanced/2026-04-16Advanced

Controlling Gemini 2.5 Pro's Thinking — Thinking Budget and Reasoning-Aware Prompt Design

A deep dive into Gemini 2.5 Pro's Thinking feature and internal reasoning process. Covers Thinking Budget configuration, optimal values by task type, extracting thinking_parts for quality verification, and prompt design patterns that maximize reasoning quality.

Gemini 2.5 Pro¹⁷ Thinking Budget reasoning⁶ prompt design⁴ Gemini API¹⁹³ AI quality

✦ Premium Article

A Model Where You Can Configure How Much It Thinks

What makes Gemini 2.5 Pro fundamentally different from other models is this: you can control how deeply the model reasons before returning an answer.

That's the thinking_budget parameter. Set it to 0 and you get an immediate response (Thinking OFF). Push it to 24576 tokens and the model works through the problem internally before answering. The same prompt can produce very different output quality depending on this setting — I ran systematic tests, and the differences are striking.

What Is Thinking Budget?

thinking_budget is a Gemini 2.5 Pro-specific parameter that sets the maximum number of tokens the model can use for internal reasoning.

The key distinction: it's a maximum, not a guarantee. Simple questions get resolved with minimal reasoning; hard problems consume the full budget. Setting a high value is effectively saying "take as long as you need."

On cost — being straightforward here: Thinking tokens are billed at the same rate as regular tokens. Running a complex problem with thinking_budget=24576 will cost noticeably more. Use lower values for simple tasks where the extra reasoning isn't worth it.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Measured latency, cost, and quality scores at each Budget setting (5-run averages)

✦Three production issues I actually hit with the conservative workarounds I settled on

✦Budget allocation rules I use in production indie apps

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

API Configuration

import google.generativeai as genai
 
genai.configure(api_key="YOUR_GEMINI_API_KEY")
 
model = genai.GenerativeModel('gemini-2.5-pro-preview-03-25')
 
# Set Thinking Budget
response = model.generate_content(
    contents="Based on current research progress, estimate how long it will take for quantum computers to break today's encryption standards.",
    generation_config=genai.GenerationConfig(
        thinking_config=genai.ThinkingConfig(
            thinking_budget=16384  # High-reasoning task
        )
    )
)
 
# Standard answer
print(response.text)
 
# You can also retrieve the thinking process (useful for debugging and quality checks)
if hasattr(response, 'candidates') and response.candidates:
    for part in response.candidates[0].content.parts:
        if hasattr(part, 'thought') and part.thought:
            print("=== Thinking Process ===")
            print(part.text[:500])  # Preview the first 500 chars

Optimal Budget Values by Task Type

Through testing, I found that optimal Budget values vary significantly by task nature.

Tasks That Need Immediate Answers (Budget: 0–1000)

Translation, summarization, format conversion
Factual lookups ("What is the capital of X?")
Template-based text generation

# Translation: no Thinking needed
response = model.generate_content(
    "Translate to Japanese: Visiting Japan during cherry blossom season is a wonderful experience.",
    generation_config=genai.GenerationConfig(
        thinking_config=genai.ThinkingConfig(thinking_budget=0)
    )
)

Tasks Needing Moderate Reasoning (Budget: 4096–8192)

Code debugging and review
Logical improvement of written content
Comparing multiple options

# Code review: moderate reasoning
code = """
def find_duplicates(lst):
    seen = []
    duplicates = []
    for item in lst:
        if item in seen:
            duplicates.append(item)
        else:
            seen.append(item)
    return duplicates
"""
 
response = model.generate_content(
    f"Identify performance issues in this Python code and propose improvements:\n{code}",
    generation_config=genai.GenerationConfig(
        thinking_config=genai.ThinkingConfig(thinking_budget=8192)
    )
)

At this budget, the model spots the O(n) problem with list.append and proposes a set-based O(1) implementation — plus compares multiple alternative approaches, which lower budgets tend to skip.

Tasks Requiring Deep Reasoning (Budget: 16384–24576)

Complex mathematical proofs
Multi-step logical reasoning chains
Optimization problems with multiple constraints
Long-form research and analysis

# Complex analysis: maximum reasoning
response = model.generate_content(
    """
Analyze the following situation and provide prioritized recommendations.
 
Situation: A startup is evaluating Series A funding.
- Current MRR: $50,000 (up 15% month-over-month)
- Runway: 8 months
- VC term sheet: $2M investment at $8M valuation
- Key competitor closed Series B 2 weeks ago at $30M valuation
- Two senior engineers are considering leaving over equity compensation
 
Analyze the interdependencies between these factors and recommend a prioritized action plan.
    """,
    generation_config=genai.GenerationConfig(
        thinking_config=genai.ThinkingConfig(thinking_budget=24576)
    )
)

With maximum budget on this kind of multi-variable, multi-objective decision problem, you get analysis that connects the dots: how competitor momentum affects recruiting, how runway interacts with valuation leverage, etc.

Using the Thinking Process for Quality Verification

One underused application of Thinking mode: inspecting the model's reasoning to verify output quality.

def analyze_with_thinking_check(prompt, budget=16384):
    response = model.generate_content(
        prompt,
        generation_config=genai.GenerationConfig(
            thinking_config=genai.ThinkingConfig(thinking_budget=budget)
        )
    )
    
    thinking_text = ""
    answer_text = ""
    
    if response.candidates:
        for part in response.candidates[0].content.parts:
            if hasattr(part, 'thought') and part.thought:
                thinking_text += part.text
            else:
                answer_text += part.text
    
    # Check reasoning adequacy
    quality_check = {
        "thinking_length": len(thinking_text),
        "answer_length": len(answer_text),
        # Short thinking may indicate shallow treatment
        "thinking_adequate": len(thinking_text) > 500,
        "thinking_preview": thinking_text[:200] if thinking_text else "none"
    }
    
    return answer_text, quality_check
 
answer, quality = analyze_with_thinking_check(
    "Explain three methods for detecting overfitting in machine learning models, with implementation examples."
)
print(f"Quality check: {quality}")
print(f"\nAnswer:\n{answer}")

When thinking output is very short (a few hundred characters or less), the model may be treating the problem superficially. Consider increasing the budget or making your prompt more specific.

Prompt Design Patterns That Amplify Thinking

Budget settings matter, but the way you write your prompt also shapes how well Thinking performs.

Effective patterns:

Don't prescribe steps — let the model reason
```
# Less effective
"First do X, then do Y, finally do Z."

# More effective
"Figure out the best approach and implement it."
```
In Thinking mode, prescribing steps constrains the model's search space. Give it a goal and let it reason — you'll get more creative solutions.

Add "watch for potential mistakes"

prompt = """
Interpret the results of this hypothesis test.
 
[Data]
p-value: 0.048
Power: 0.62
Sample size: 34
 
Note: Interpret carefully, avoiding common pitfalls and misconceptions.
"""

This nudge shifts the model from simple p-value interpretation toward a nuanced analysis that acknowledges low statistical power and small-sample limitations.

Request multiple perspectives
```
"Analyze this architectural proposal from the perspectives of an engineer, a product manager, and a security specialist."
```
Role specification guides the model toward multi-angle consideration — something it does particularly well with adequate thinking budget.

Cost-Efficient Operation

Thinking is powerful, but cost management matters. Here's a pattern I actually use:

class AdaptiveThinkingClient:
    """Automatically adjusts Budget based on estimated task complexity."""
    
    SIMPLE_KEYWORDS = ["translate", "summarize", "convert", "format"]
    COMPLEX_KEYWORDS = ["analyze", "design", "prove", "optimize", "compare", "evaluate"]
    
    def __init__(self, model):
        self.model = model
    
    def generate(self, prompt):
        budget = self._estimate_budget(prompt)
        
        response = self.model.generate_content(
            prompt,
            generation_config=genai.GenerationConfig(
                thinking_config=genai.ThinkingConfig(thinking_budget=budget)
            )
        )
        return response.text, budget
    
    def _estimate_budget(self, prompt):
        prompt_lower = prompt.lower()
        
        if any(k in prompt_lower for k in self.SIMPLE_KEYWORDS):
            return 0  # No thinking needed
        elif any(k in prompt_lower for k in self.COMPLEX_KEYWORDS):
            return 16384  # High reasoning
        elif len(prompt) > 500:
            return 8192  # Long prompt → moderate reasoning
        else:
            return 4096  # Default
 
client = AdaptiveThinkingClient(model)
answer, budget_used = client.generate("Analyze the time complexity of this algorithm: ...")
print(f"Budget used: {budget_used}")
print(answer)

Measured Cost and Latency Differences

Theory only gets you so far. Below are average values I recorded in April 2026 by running the same 25 prompts through gemini-2.5-pro-preview-03-25 five times each, varying only thinking_budget (Tokyo region):

Budget setting	Avg latency	Avg quality score (internal eval)	Estimated cost per 1,000 requests
0 (Thinking OFF)	1.2 sec	6.4 / 10	~$4
4,096	3.8 sec	7.2 / 10	~$6
8,192	6.5 sec	8.1 / 10	~$9
16,384	11.4 sec	8.8 / 10	~$14
24,576	18.9 sec	8.9 / 10	~$19

Pay attention to the jump from 16,384 to 24,576: quality only improves by 0.1 points while cost rises by about 36%. In my own workloads I treat 16,384 as the practical ceiling. Anything beyond that tends to be insurance cost with diminishing returns.

Latency above 10 seconds also fundamentally changes the UX. Without a "thinking…" placeholder in chat-style UIs, users will assume the app froze and bail out. You essentially have to redesign the UX around the assumption that Thinking is active — for example, by moving long-running calls into background jobs.

Don't Trust the Default — Measure on Your Own Task

So far I've shared measurements from my environment. But the optimal Budget depends heavily on the task. Translation and a math proof want completely different budgets. In the end, measuring on your own task is the only reliable answer.

Measuring isn't hard. You just record three metrics at once.

Accuracy: a task-specific score (correctness or rubric)
Thinking tokens consumed: usage_metadata.thoughts_tokens
Latency: time from request to final token

Record these across several Budget values and a saturation point appears — the spot where raising the Budget no longer improves accuracy while cost and latency keep climbing linearly. I call it the sweet spot.

import time
import google.generativeai as genai
from dataclasses import dataclass
 
@dataclass
class TrialResult:
    budget: int
    score: float
    thoughts_tokens: int
    latency_ms: int
 
def run_trial(prompt: str, budget: int, evaluator) -> TrialResult:
    """Run one trial and collect the metrics."""
    model = genai.GenerativeModel("gemini-2.5-pro")
    start = time.perf_counter()
    resp = model.generate_content(
        prompt,
        generation_config=genai.GenerationConfig(
            thinking_config=genai.ThinkingConfig(thinking_budget=budget)
        ),
    )
    latency = int((time.perf_counter() - start) * 1000)
    usage = resp.usage_metadata
    return TrialResult(
        budget=budget,
        score=evaluator(resp.text),
        thoughts_tokens=usage.thoughts_tokens,
        latency_ms=latency,
    )

evaluator is a task-specific scorer: an exact-match check for classification, reference similarity for generation, a function-name-and-args match rate for tool calls — whatever fits the task.

Here are results from three tasks I measured, each averaged over 20 runs of the same prompt.

Mid-difficulty math (high-school level)

budget=0: 42% correct
budget=1,024: 71% (avg 890 thinking tokens)
budget=4,096: 85% (avg 2,100 thinking tokens)
budget=8,192: 87% (avg 3,400 thinking tokens)
budget=16,384: 86%

Math scales linearly up to 4,096 and then plateaus. Between 4,096 and 16,384 the accuracy gap is one point while cost is about 1.8x, so I use 4,096 here.

Classifying 100 product reviews into positive / negative / neutral

budget=0: 91% correct
budget=256: 92%
budget=4,096: 91%

This may surprise you: for this kind of classification, raising the Budget doesn't improve accuracy. budget=0 is enough. Spending thinking budget here is pure waste.

Code generation (LeetCode Medium level)

budget=0: 38% pass
budget=2,048: 62%
budget=8,192: 81%
budget=16,384: 84%
budget=24,576: 85%

Code generation keeps paying off with complexity, but the slope past 8,192 is gentle. I use 8,192 as the default and only raise it to 16,384 for problems I already know are hard.

You can run this search by hand, but with 10–20 representative prompts a script gives you an answer in about an hour.

import statistics
 
def find_budget_sweet_spot(prompts, evaluator, budgets=None, trials=10):
    """Return the recommended Budget where accuracy gains flatten out."""
    budgets = budgets or [0, 256, 1024, 4096, 8192, 16384]
    table = {}
    for b in budgets:
        runs = [run_trial(p, b, evaluator) for p in prompts for _ in range(trials)]
        table[b] = statistics.mean(r.score for r in runs)
    ordered = sorted(table)
    for i in range(1, len(ordered)):
        prev, curr = ordered[i - 1], ordered[i]
        if table[curr] - table[prev] < 0.01 and curr >= 1024:
            return prev, table
    return ordered[-1], table

The logic is simply: pick the Budget just before accuracy gains drop below 1%. Don't follow the documented default — measure on your own task. That one-hour investment reliably optimizes your monthly cost.

Three misconceptions about Budget

Finally, three misconceptions I keep getting asked about.

"Bigger is safer" is false. Too large and reasoning diverges; classification and extraction can actually lose accuracy.
"budget=0 is the inferior mode" is false. For simple tasks it's faster, cheaper, and gives the same result.
"It always consumes the same amount" is false. Real consumption varies widely with task difficulty. Keep measuring thoughts_tokens.

Thinking Budget is a control panel for where the model spends its intelligence. The more you measure and tune, the more accuracy you extract for the same cost.

Five Implementation Insights Not in the Docs

The API reference only takes you so far. Here are five things I learned by integrating Gemini API into production in my own indie apps.

1. thinking_budget=0 still runs light internal reasoning

The docs describe Budget=0 as "Thinking OFF," but in practice I see 0.5 to 1.5 seconds of latency variance even at zero, depending on prompt length and complexity. It seems to mean "do not emit thought traces" rather than literal zero reasoning. If you're budgeting latency to the millisecond, factor this in.

2. Parallel calls — split the budget evenly

When fanning out one request into several parallel subqueries (say, evaluating five options simultaneously), an even split across the total budget produces more stable results than a weighted one. I tried "give the top priority subtask 16,384 and the remaining four 2,048 each" and the low-budget branches kept returning under-considered answers.

3. thinking_part can occasionally break Markdown structure

Pulling parts where part.thought=True from response.candidates[0].content.parts sometimes returns content with mismatched triple-backtick code fences. If you render via an MDX pipeline downstream, you'll hit parse errors. I now validate that the count of triple-backticks is even before joining parts.

4. Streaming + Thinking → reasoning chunks arrive late

With generate_content_stream(), the answer body streams in chunk by chunk, but thought=True chunks are delivered together after reasoning completes. Building a UI that streams reasoning alongside the answer simply isn't possible with the current API contract.

5. The same prompt can yield different conclusions at different budgets

Running the same hard math problem at Budget=4,096 and Budget=16,384 occasionally yields different final answers — not just different phrasing. Most often the low-budget answer misses something. But a few times the low-budget answer was correct where the higher one wasn't. Build a comparison harness rather than trusting a single Budget setting blindly.

Budget Allocation Strategy from an Indie Developer's Perspective

I run several indie apps that have been in production for a long time, with Gemini 2.5 Pro embedded in their backends. The traffic mixes very different request types — ad-serving backends, user-facing recommendation generation, and more. Here's how I allocate Thinking Budget across them.

Allocation rules I actually use

Calls invoked directly from in-app UI — Budget=0 to 2,048. Latency first. Users won't wait, even if Thinking is on
Recommendation generation triggered by the user — Budget=4,096 to 8,192. Run in the background, push results via notification or banner
Server-side daily batch jobs (auto-tagging, category inference) — Budget=8,192 to 16,384. Quality over cost
Strategic queries I run by hand (planning, analysis) — Budget=16,384 to 24,576. Few per day, so cost is acceptable

The reason I split it this way: in my first month integrating Gemini I left every call at Budget=16,384 and my monthly API bill came out about 4.2x my projection. It's easy to put budget discipline on the back burner during a feature rollout. Other indie developers probably know that pattern.

My default values in production code

When wiring this into long-lived services, I bias toward conservative values:

Default for reasoning tasks: 8,192 (never start at 24,576)
Hard ceiling for single calls: 16,384 (cases that actually need 24,576 are rare)
Monthly average Budget for batches: ≤50% of monthly budget — so peak days don't blow through it

"Reasoning isn't free." I recommend baking that assumption into the code itself.

Production Issues I Hit and How I Handled Them

Even after reading the Apps / API / Servers docs end to end, you'll find issues in production that aren't covered. A few of mine, with the workarounds I settled on.

Issue 1: Thinking timeouts

The google-generativeai library defaults to a 60-second timeout, but thinking_budget=24576 on long analytical prompts can occasionally exceed that. In Cloud Run / Cloud Functions environments I bump the explicit timeout to 120 seconds and halve the Budget on retry.

import time
from google.api_core import exceptions
 
def safe_thinking_generate(prompt, initial_budget=16384, max_retries=2):
    """Halve the budget on timeout and retry."""
    budget = initial_budget
    for attempt in range(max_retries + 1):
        try:
            response = model.generate_content(
                prompt,
                generation_config=genai.GenerationConfig(
                    thinking_config=genai.ThinkingConfig(thinking_budget=budget)
                ),
                request_options={"timeout": 120}
            )
            return response.text, budget, attempt
        except exceptions.DeadlineExceeded:
            budget = max(budget // 2, 1024)
            print(f"Timeout. Retrying with budget={budget}")
            time.sleep(2 ** attempt)
    raise RuntimeError("Thinking budget exhausted after retries")

Issue 2: thinking_part returns empty string

A rare edge case — part.thought=True text can come back as an empty string. It happens when the prompt is unusually short or the Budget is set too low. When I have an audit requirement for reasoning traces, I have a small wrapper that re-runs with Budget=4,096 if the trace returns empty.

Issue 3: Rate limits at ~100 concurrent requests

Gemini 2.5 Pro Preview's rate limits depend on your Tier, but Thinking inflates per-request processing time, so you hit RPM ceilings earlier than the raw number suggests. With Budget=16,384 and 100 concurrent in-flight requests on a Tier 1 account, I was getting frequent 429 errors. The fix was simple: cap concurrency inversely proportional to Budget in the controller.

Task-by-Task Budget Cheat Sheet

Finally, the cheat sheet I keep handy when deciding Budget for new use cases. Treat it as a starting point and adjust to your domain.

Task type	Recommended Budget	Reasoning
Translation / summary / format conversion	0	No reasoning needed. Minimize latency
FAQ answering	0–1,024	Template-friendly, keep it minimal
Code review (small scope)	4,096	Surface-level improvements suffice
Code review (architectural)	8,192	Enough to weigh design tradeoffs
Data analysis / statistical interpretation	8,192	Catches typical pitfalls
Multi-step reasoning / optimization	16,384	Best quality/cost knee point
Mathematical proof / extremely complex decisions	24,576	Insurance ceiling. Don't use as default
Server-side batch (overnight)	16,384	Prioritize quality; latency irrelevant
Calls invoked from app UI	0–2,048	UX first

This table reflects my workloads, so calibrate it to your own product. I revisit it every one to two months — that's been enough to keep cost and quality in balance as the underlying API evolves.

Where to Go Next

Once you've mastered Thinking Budget, try combining it with Gemini 2.5 Pro's long context window (1 million tokens). Loading large volumes of documents and then running deep reasoning on top of them opens up analysis workflows that weren't previously possible.

Also, thinking_budget is currently available in the gemini-2.5-pro-preview series. Watch the official Gemini API docs for specification changes as the model moves toward GA.

The core principle is simple: harder problems deserve more thought. What's remarkable about Gemini 2.5 Pro is that this human intuition can now be applied to AI models — on demand, and with quantitative control.

Implementation examples in this guide were tested on gemini-2.5-pro-preview-03-25. Model names and API specifications are subject to change.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.