●SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMA●FLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasks●IMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25●MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxes●FILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2●DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon●SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMA●FLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasks●IMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25●MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxes●FILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2●DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
A deep dive into Gemini 2.5 Pro's Thinking feature and internal reasoning process. Covers Thinking Budget configuration, optimal values by task type, extracting thinking_parts for quality verification, and prompt design patterns that maximize reasoning quality.
A Model Where You Can Configure How Much It Thinks
What makes Gemini 2.5 Pro fundamentally different from other models is this: you can control how deeply the model reasons before returning an answer.
That's the thinking_budget parameter. Set it to 0 and you get an immediate response (Thinking OFF). Push it to 24576 tokens and the model works through the problem internally before answering. The same prompt can produce very different output quality depending on this setting — I ran systematic tests, and the differences are striking.
What Is Thinking Budget?
thinking_budget is a Gemini 2.5 Pro-specific parameter that sets the maximum number of tokens the model can use for internal reasoning.
The key distinction: it's a maximum, not a guarantee. Simple questions get resolved with minimal reasoning; hard problems consume the full budget. Setting a high value is effectively saying "take as long as you need."
On cost — being straightforward here: Thinking tokens are billed at the same rate as regular tokens. Running a complex problem with thinking_budget=24576 will cost noticeably more. Use lower values for simple tasks where the extra reasoning isn't worth it.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Measured latency, cost, and quality scores at each Budget setting (5-run averages)
✦Three production issues I actually hit with the conservative workarounds I settled on
✦Budget allocation rules from an indie developer running 50M+ download apps
Secure payment via Stripe · Cancel anytime
API Configuration
import google.generativeai as genaigenai.configure(api_key="YOUR_GEMINI_API_KEY")model = genai.GenerativeModel('gemini-2.5-pro-preview-03-25')# Set Thinking Budgetresponse = model.generate_content( contents="Based on current research progress, estimate how long it will take for quantum computers to break today's encryption standards.", generation_config=genai.GenerationConfig( thinking_config=genai.ThinkingConfig( thinking_budget=16384 # High-reasoning task ) ))# Standard answerprint(response.text)# You can also retrieve the thinking process (useful for debugging and quality checks)if hasattr(response, 'candidates') and response.candidates: for part in response.candidates[0].content.parts: if hasattr(part, 'thought') and part.thought: print("=== Thinking Process ===") print(part.text[:500]) # Preview the first 500 chars
Optimal Budget Values by Task Type
Through testing, I found that optimal Budget values vary significantly by task nature.
Tasks That Need Immediate Answers (Budget: 0–1000)
Translation, summarization, format conversion
Factual lookups ("What is the capital of X?")
Template-based text generation
# Translation: no Thinking neededresponse = model.generate_content( "Translate to Japanese: Visiting Japan during cherry blossom season is a wonderful experience.", generation_config=genai.GenerationConfig( thinking_config=genai.ThinkingConfig(thinking_budget=0) ))
# Code review: moderate reasoningcode = """def find_duplicates(lst): seen = [] duplicates = [] for item in lst: if item in seen: duplicates.append(item) else: seen.append(item) return duplicates"""response = model.generate_content( f"Identify performance issues in this Python code and propose improvements:\n{code}", generation_config=genai.GenerationConfig( thinking_config=genai.ThinkingConfig(thinking_budget=8192) ))
At this budget, the model spots the O(n) problem with list.append and proposes a set-based O(1) implementation — plus compares multiple alternative approaches, which lower budgets tend to skip.
Tasks Requiring Deep Reasoning (Budget: 16384–24576)
Complex mathematical proofs
Multi-step logical reasoning chains
Optimization problems with multiple constraints
Long-form research and analysis
# Complex analysis: maximum reasoningresponse = model.generate_content( """Analyze the following situation and provide prioritized recommendations.Situation: A startup is evaluating Series A funding.- Current MRR: $50,000 (up 15% month-over-month)- Runway: 8 months- VC term sheet: $2M investment at $8M valuation- Key competitor closed Series B 2 weeks ago at $30M valuation- Two senior engineers are considering leaving over equity compensationAnalyze the interdependencies between these factors and recommend a prioritized action plan. """, generation_config=genai.GenerationConfig( thinking_config=genai.ThinkingConfig(thinking_budget=24576) ))
With maximum budget on this kind of multi-variable, multi-objective decision problem, you get analysis that connects the dots: how competitor momentum affects recruiting, how runway interacts with valuation leverage, etc.
Using the Thinking Process for Quality Verification
One underused application of Thinking mode: inspecting the model's reasoning to verify output quality.
def analyze_with_thinking_check(prompt, budget=16384): response = model.generate_content( prompt, generation_config=genai.GenerationConfig( thinking_config=genai.ThinkingConfig(thinking_budget=budget) ) ) thinking_text = "" answer_text = "" if response.candidates: for part in response.candidates[0].content.parts: if hasattr(part, 'thought') and part.thought: thinking_text += part.text else: answer_text += part.text # Check reasoning adequacy quality_check = { "thinking_length": len(thinking_text), "answer_length": len(answer_text), # Short thinking may indicate shallow treatment "thinking_adequate": len(thinking_text) > 500, "thinking_preview": thinking_text[:200] if thinking_text else "none" } return answer_text, quality_checkanswer, quality = analyze_with_thinking_check( "Explain three methods for detecting overfitting in machine learning models, with implementation examples.")print(f"Quality check: {quality}")print(f"\nAnswer:\n{answer}")
When thinking output is very short (a few hundred characters or less), the model may be treating the problem superficially. Consider increasing the budget or making your prompt more specific.
Prompt Design Patterns That Amplify Thinking
Budget settings matter, but the way you write your prompt also shapes how well Thinking performs.
Effective patterns:
Don't prescribe steps — let the model reason
# Less effective
"First do X, then do Y, finally do Z."
# More effective
"Figure out the best approach and implement it."
In Thinking mode, prescribing steps constrains the model's search space. Give it a goal and let it reason — you'll get more creative solutions.
Add "watch for potential mistakes"
prompt = """Interpret the results of this hypothesis test.[Data]p-value: 0.048Power: 0.62Sample size: 34Note: Interpret carefully, avoiding common pitfalls and misconceptions."""
This nudge shifts the model from simple p-value interpretation toward a nuanced analysis that acknowledges low statistical power and small-sample limitations.
Request multiple perspectives
"Analyze this architectural proposal from the perspectives of an engineer, a product manager, and a security specialist."
Role specification guides the model toward multi-angle consideration — something it does particularly well with adequate thinking budget.
Cost-Efficient Operation
Thinking is powerful, but cost management matters. Here's a pattern I actually use:
class AdaptiveThinkingClient: """Automatically adjusts Budget based on estimated task complexity.""" SIMPLE_KEYWORDS = ["translate", "summarize", "convert", "format"] COMPLEX_KEYWORDS = ["analyze", "design", "prove", "optimize", "compare", "evaluate"] def __init__(self, model): self.model = model def generate(self, prompt): budget = self._estimate_budget(prompt) response = self.model.generate_content( prompt, generation_config=genai.GenerationConfig( thinking_config=genai.ThinkingConfig(thinking_budget=budget) ) ) return response.text, budget def _estimate_budget(self, prompt): prompt_lower = prompt.lower() if any(k in prompt_lower for k in self.SIMPLE_KEYWORDS): return 0 # No thinking needed elif any(k in prompt_lower for k in self.COMPLEX_KEYWORDS): return 16384 # High reasoning elif len(prompt) > 500: return 8192 # Long prompt → moderate reasoning else: return 4096 # Defaultclient = AdaptiveThinkingClient(model)answer, budget_used = client.generate("Analyze the time complexity of this algorithm: ...")print(f"Budget used: {budget_used}")print(answer)
Measured Cost and Latency Differences
Theory only gets you so far. Below are average values I recorded in April 2026 by running the same 25 prompts through gemini-2.5-pro-preview-03-25 five times each, varying only thinking_budget (Tokyo region):
Pay attention to the jump from 16,384 to 24,576: quality only improves by 0.1 points while cost rises by about 36%. In my own workloads I treat 16,384 as the practical ceiling. Anything beyond that tends to be insurance cost with diminishing returns.
Latency above 10 seconds also fundamentally changes the UX. Without a "thinking…" placeholder in chat-style UIs, users will assume the app froze and bail out. You essentially have to redesign the UX around the assumption that Thinking is active — for example, by moving long-running calls into background jobs.
Five Implementation Insights Not in the Docs
The API reference only takes you so far. Here are five things I learned by integrating Gemini API into production while running an indie app business with 50M+ downloads.
1. thinking_budget=0 still runs light internal reasoning
The docs describe Budget=0 as "Thinking OFF," but in practice I see 0.5 to 1.5 seconds of latency variance even at zero, depending on prompt length and complexity. It seems to mean "do not emit thought traces" rather than literal zero reasoning. If you're budgeting latency to the millisecond, factor this in.
2. Parallel calls — split the budget evenly
When fanning out one request into several parallel subqueries (say, evaluating five options simultaneously), an even split across the total budget produces more stable results than a weighted one. I tried "give the top priority subtask 16,384 and the remaining four 2,048 each" and the low-budget branches kept returning under-considered answers.
3. thinking_part can occasionally break Markdown structure
Pulling parts where part.thought=True from response.candidates[0].content.parts sometimes returns content with mismatched triple-backtick code fences. If you render via an MDX pipeline downstream, you'll hit parse errors. I now validate that the count of triple-backticks is even before joining parts.
4. Streaming + Thinking → reasoning chunks arrive late
With generate_content_stream(), the answer body streams in chunk by chunk, but thought=True chunks are delivered together after reasoning completes. Building a UI that streams reasoning alongside the answer simply isn't possible with the current API contract.
5. The same prompt can yield different conclusions at different budgets
Running the same hard math problem at Budget=4,096 and Budget=16,384 occasionally yields different final answers — not just different phrasing. Most often the low-budget answer misses something. But a few times the low-budget answer was correct where the higher one wasn't. Build a comparison harness rather than trusting a single Budget setting blindly.
Budget Allocation Strategy from an Indie Developer's Perspective
I've been shipping iPhone and Android apps as an indie developer (Masaki Hirokawa / Dolice) since 2014. My wallpaper and ambient apps have crossed 50M downloads in aggregate and still serve tens of millions of AdMob impressions per month. Here's how I allocate Thinking Budget across those production apps.
Allocation rules I actually use
Calls invoked directly from in-app UI — Budget=0 to 2,048. Latency first. Users won't wait, even if Thinking is on
Recommendation generation triggered by the user — Budget=4,096 to 8,192. Run in the background, push results via notification or banner
Server-side daily batch jobs (auto-tagging, category inference) — Budget=8,192 to 16,384. Quality over cost
Strategic queries I run by hand (planning, analysis) — Budget=16,384 to 24,576. Few per day, so cost is acceptable
The reason I split it this way: in my first month integrating Gemini I left every call at Budget=16,384 and my monthly API bill came out about 4.2x my projection. Cost discipline has been with me since I first taught myself programming in 1997, but it's easy to lose track during a feature rollout. Other indie developers probably know that pattern.
My default values in production code
When wiring this into long-lived services, I bias toward conservative values:
Default for reasoning tasks: 8,192 (never start at 24,576)
Hard ceiling for single calls: 16,384 (cases that actually need 24,576 are rare)
Monthly average Budget for batches: ≤50% of monthly budget — so peak days don't blow through it
"Reasoning isn't free." I recommend baking that assumption into the code itself.
Production Issues I Hit and How I Handled Them
Even after reading the Apps / API / Servers docs end to end, you'll find issues in production that aren't covered. A few of mine, with the workarounds I settled on.
Issue 1: Thinking timeouts
The google-generativeai library defaults to a 60-second timeout, but thinking_budget=24576 on long analytical prompts can occasionally exceed that. In Cloud Run / Cloud Functions environments I bump the explicit timeout to 120 seconds and halve the Budget on retry.
import timefrom google.api_core import exceptionsdef safe_thinking_generate(prompt, initial_budget=16384, max_retries=2): """Halve the budget on timeout and retry.""" budget = initial_budget for attempt in range(max_retries + 1): try: response = model.generate_content( prompt, generation_config=genai.GenerationConfig( thinking_config=genai.ThinkingConfig(thinking_budget=budget) ), request_options={"timeout": 120} ) return response.text, budget, attempt except exceptions.DeadlineExceeded: budget = max(budget // 2, 1024) print(f"Timeout. Retrying with budget={budget}") time.sleep(2 ** attempt) raise RuntimeError("Thinking budget exhausted after retries")
Issue 2: thinking_part returns empty string
A rare edge case — part.thought=True text can come back as an empty string. It happens when the prompt is unusually short or the Budget is set too low. When I have an audit requirement for reasoning traces, I have a small wrapper that re-runs with Budget=4,096 if the trace returns empty.
Issue 3: Rate limits at ~100 concurrent requests
Gemini 2.5 Pro Preview's rate limits depend on your Tier, but Thinking inflates per-request processing time, so you hit RPM ceilings earlier than the raw number suggests. With Budget=16,384 and 100 concurrent in-flight requests on a Tier 1 account, I was getting frequent 429 errors. The fix was simple: cap concurrency inversely proportional to Budget in the controller.
Task-by-Task Budget Cheat Sheet
Finally, the cheat sheet I keep handy when deciding Budget for new use cases. Treat it as a starting point and adjust to your domain.
This table reflects my workloads, so calibrate it to your own product. I revisit it every one to two months — that's been enough to keep cost and quality in balance as the underlying API evolves.
Where to Go Next
Once you've mastered Thinking Budget, try combining it with Gemini 2.5 Pro's long context window (1 million tokens). Loading large volumes of documents and then running deep reasoning on top of them opens up analysis workflows that weren't previously possible.
Also, thinking_budget is currently available in the gemini-2.5-pro-preview series. Watch the official Gemini API docs for specification changes as the model moves toward GA.
The core principle is simple: harder problems deserve more thought. What's remarkable about Gemini 2.5 Pro is that this human intuition can now be applied to AI models — on demand, and with quantitative control.
Implementation examples in this guide were tested on gemini-2.5-pro-preview-03-25. Model names and API specifications are subject to change.
What Thinking Budget really controls
thinking_budget is an upper bound on tokens spent on the internal chain of thought before a response. You configure it with thinking_config={"thinking_budget": N}. Setting N to 0 disables thinking entirely; setting it high allows longer reasoning.
The key insight: the value is an upper bound, not a target. The model will stop thinking when it has enough, so the actual tokens consumed — available as usage.thoughts_tokens — can be far below the budget. Asking for 8192 doesn't force 8192; a simple task might burn only 300.
This is why "just set it high" is only half right. When the budget is too generous, the model sometimes over-explores and produces worse answers than at a medium budget. "I raised the budget and accuracy dropped" is a real phenomenon, especially on structured or classification tasks.
The measurement rig — three metrics, always together
To find the right value for your task, record these three numbers together across multiple budgets:
Accuracy — whatever task-specific score applies
Thought tokens consumed — from usage.thoughts_tokens
Latency — time from request to final token
Plot them and you'll spot the knee of the curve: the budget beyond which accuracy barely moves while cost and latency keep rising.
import timeimport google.generativeai as genaifrom dataclasses import dataclass@dataclassclass TrialResult: budget: int score: float thoughts_tokens: int latency_ms: int output_tokens: intdef run_trial(prompt: str, budget: int, evaluator) -> TrialResult: """Run one trial and collect the three metrics we care about.""" model = genai.GenerativeModel( "gemini-2.5-pro", generation_config={"thinking_config": {"thinking_budget": budget}}, ) start = time.perf_counter() resp = model.generate_content(prompt) latency = int((time.perf_counter() - start) * 1000) score = evaluator(resp.text) usage = resp.usage_metadata return TrialResult( budget=budget, score=score, thoughts_tokens=usage.thoughts_tokens, latency_ms=latency, output_tokens=usage.candidates_token_count, )
The evaluator is task-specific. For classification, it's an exact-match check. For generation, it's BLEU or a reference-based similarity. For tool calling, it's whether the function name and arguments match. Pick whichever scoring rule makes sense for the job you actually ship.
What I measured, and what surprised me
Below are averages of 20 runs per budget on three tasks I run a lot.
Task A: mid-difficulty math (high-school level)
budget=0: 42% accuracy, 0 thought tokens
budget=1024: 71% accuracy, 890 avg thought tokens
budget=4096: 85% accuracy, 2,100 avg thought tokens
budget=8192: 87% accuracy, 3,400 avg thought tokens
budget=16384: 86% accuracy, 3,800 avg thought tokens
Math scales roughly linearly up to 4096, then plateaus. The jump from 4096 to 16384 adds one percentage point at nearly 2x the cost. I use 4096 in production.
This one surprises most people. Classification does not benefit from a bigger budget at all; budget=0 is within a percentage point of the maximum. Paying for thinking tokens here is a net-negative decision.
Task C: code generation at LeetCode Medium difficulty
budget=0: 38% passed
budget=2048: 62% passed
budget=8192: 81% passed
budget=16384: 84% passed
budget=24576: 85% passed
Code scales further than math, but the slope softens after 8192. I use 8192 as the default and bump to 16384 only when I know the problem is hard.
A script that finds your sweet spot automatically
Once you've built the measurement rig, you may as well automate the sweep. This small helper runs increasing budgets and returns the first one where the marginal accuracy gain drops below 1%.
import statisticsdef find_budget_sweet_spot(prompts, evaluator, budgets=None, trials_per_budget=10): """Return the recommended budget and full results table for a task.""" if budgets is None: budgets = [0, 256, 1024, 4096, 8192, 16384] results = {} for b in budgets: trials = [] for prompt in prompts: for _ in range(trials_per_budget): trials.append(run_trial(prompt, b, evaluator)) results[b] = { "avg_score": statistics.mean(t.score for t in trials), "avg_thoughts": statistics.mean(t.thoughts_tokens for t in trials), "avg_latency": statistics.mean(t.latency_ms for t in trials), } # First point where marginal accuracy gain < 1% is the sweet spot sorted_b = sorted(results.keys()) for i in range(1, len(sorted_b)): prev, curr = sorted_b[i-1], sorted_b[i] delta = results[curr]["avg_score"] - results[prev]["avg_score"] if delta < 0.01 and curr >= 1024: return prev, results return sorted_b[-1], results
Ten to twenty representative prompts and an hour of your time is all you need. The important shift is to treat the docs' suggestion as a starting point, not a final answer.
Three production habits that compounded
Habit 1: per-task budget routing. My app has four request types — classify, summarize, generate, reason — and each gets its own budget. Fixing everything at the maximum cost me about 50% more per month than necessary.
Habit 2: difficulty estimation up front. Even inside "generate," some prompts are trivial and some are hard. I run a tiny Gemini Flash call first to estimate difficulty, then pick a budget accordingly. The overhead is a few hundred tokens and easily pays for itself on long-tail hard requests.
Habit 3: route 10% of traffic to budget=0 as an A/B. For simpler tasks I keep a small slice of traffic at budget=0 as a control. If accuracy holds for a week, I flip that task to budget=0 permanently. This gradually prunes unneeded thinking from my pipeline.
Three misconceptions I keep hearing
"Higher is always safer." It's not. Over-large budgets can cause the model to over-explore and return worse answers on structured tasks
"Budget=0 is the broken mode." For simple classification or extraction, budget=0 is often faster, cheaper, and equally accurate
"The budget is what you pay each time." You pay for what was consumed, not what you allowed. Watch usage.thoughts_tokens in production
Thinking Budget is less a power dial and more a routing decision: where do you spend the model's mental effort? The more carefully you route, the more accuracy you can squeeze out of the same bill.
What to try today
Run ten of your existing prompts at three budgets (0, 1024, 8192). Look at accuracy vs. cost for an hour. In my experience that one afternoon is enough to cut a meaningful slice off your monthly API bill — and to gain a calibration habit you'll use on every new task going forward.
Share
Thank You for Reading
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.