●API — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAI●ENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companies●AGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choices●SPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContent●DATA — Crossbeam data stores can now connect to Gemini Enterprise in public preview●MODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloads●API — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAI●ENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companies●AGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choices●SPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContent●DATA — Crossbeam data stores can now connect to Gemini Enterprise in public preview●MODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloads
Your Tool Results Are Quietly Eating the Conversation — Handle Passing to Keep Gemini Function Calling Contexts Lean
Tool results linger in Function Calling history and compound your input tokens every turn. Two implementations — a token-budgeted compactor and handle passing — cut my measured input by roughly 8x, with the pitfalls I hit along the way.
When I pointed an agent at 312 app reviews, everything slowed down from turn three onward. Responses that came back in two seconds early in the loop were taking over ten by the end, and at month's close, this one job's line on the invoice looked oddly inflated.
As an indie developer, I run a nightly agent that classifies and aggregates reviews for my apps using Gemini's Function Calling. The naive implementation — stuff whatever fetch_reviews returns straight into the functionResponse — worked fine while review counts were small. Then a busy month arrived, and the behavior flipped. The culprit wasn't generation output at all: the huge JSON my tool returned was sitting in the conversation history and getting re-sent as input on every subsequent turn.
In a Function Calling loop, you append the model's functionCall and your functionResponse to contents and send the whole thing again. So a tool result weighing 48,000 tokens gets billed five more times across a six-turn loop. If you only watch output tokens, this input-side compounding is easy to miss.
Measure Input Tokens per Turn — the Culprit Is Residency
Before fixing anything, put numbers on it. On each turn, run the full contents you're about to send through count_tokens.
# turn_meter.py — chart per-turn input token growth in your agent loop# Solves: turning "the later turns feel slow" into a numberfrom google import genaiclient = genai.Client() # API key from GEMINI_API_KEY env varMODEL = "gemini-flash-latest"def log_turn_tokens(turn: int, contents: list) -> int: """Record token count of the full contents just before sending""" result = client.models.count_tokens(model=MODEL, contents=contents) print(f"turn={turn} input_tokens={result.total_tokens}") return result.total_tokens
Counting is free-tier friendly and it's one extra line in the loop. My review-analysis agent (a six-turn run that fetches reviews from two stores, classifies, and aggregates) looked like this before any fixes:
Turn
What's in context
Input tokens (measured)
1
Instructions only
1,240
2
+ fetch_reviews result (store A, 312 rows)
49,800
3
+ intermediate classification output
52,300
4
+ fetch_reviews result (store B, 287 rows)
101,900
5
+ intermediate aggregation output
104,500
6
Final report generation
105,100
One run consumed about 410,000 input tokens in total, and roughly 80% of that was re-sent review JSON. I was paying most of the job's latency and cost for repetition that contributed nothing to output quality.
Three Ways to Hand Over a Tool Result — Full, Compacted, or by Handle
How a tool result reaches the model is a design choice, even though it rarely gets treated as one. I've settled on this three-way framing:
Approach
What the model sees
Good fit
Risk
Full payload
The entire result JSON
Small results (roughly under 2,000 tokens)
History residency compounds input
Budgeted compaction
A slimmed version keeping priority fields
Only some fields matter to the task
A dropped field turns out to be needed later
Handle passing
Row count, schema, digest + a reference ID
Large results, detail lookups are rare
Each detail lookup costs an extra turn
The deciding question is simple: does the model genuinely need to read the full payload? Data like reviews — processed once, never re-read — has no business living in the history.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦You'll be able to diagnose why your agent gets slower and pricier with every turn, using countTokens to chart per-turn input growth instead of guessing
✦You can drop in two copy-paste implementations — a token-budgeted compactor and handle passing — that cut a review-analysis agent's input tokens by roughly 8x in my measurements
✦You'll sidestep the traps of context compaction ahead of time: breaking thought signatures, over-compaction re-fetch loops, and parallel-call response ordering
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Implementation 1: a Token-Budgeted Compactor — Stop Sending Everything First
The first move is a layer that shrinks results to a budget before they go into functionResponse.
# compactor.py — shrink tool results to a token budget before functionResponse# Solves: compounding input growth from re-sending fields nobody usesimport jsonfrom google import genaiclient = genai.Client()MODEL = "gemini-flash-latest"def count(text: str) -> int: return client.models.count_tokens(model=MODEL, contents=text).total_tokensdef compact_tool_result(rows: list[dict], keep_fields: list[str], budget_tokens: int = 6000) -> dict: """Keep only priority fields; trim rows if still over budget. Always tell the model that trimming happened.""" slim = [{k: r.get(k) for k in keep_fields} for r in rows] kept = len(slim) while kept > 1 and count(json.dumps(slim[:kept], ensure_ascii=False)) > budget_tokens: kept = int(kept * 0.8) # shave 20% at a time until under budget return { "rows": slim[:kept], "total_rows": len(rows), "returned_rows": kept, "truncated": kept < len(rows), "note": "If truncated=true and you need all rows, call fetch_more", }# Usage: keep only text and rating for reviews# compacted = compact_tool_result(reviews, keep_fields=["text", "rating"])# → 312 rows / 48,200 tokens fits within 6,000 tokens
Two things matter here. First, always disclose the trim via the truncated flag and counts. Trim silently and the model will aggregate under the false belief that "there are only 287 reviews," producing a quietly wrong report. Second, cut fields before rows. Dropping rows first biases the sample; fields irrelevant to the task can go safely.
Implementation 2: Handle Passing — Keep the Payload Local, Give the Model a Table of Contents
When compaction isn't enough, flip the model of ownership: store the full result locally, and return only a count, a schema, a short digest, and a reference ID. The model pulls details through a second tool only when it actually needs them.
# handle_passing.py — pass tool results by reference# Solves: keeping huge payloads out of history, fetching fragments on demandimport uuidclass ResultStore: """A store scoped to one run. Cross-run references fail loudly.""" def __init__(self): self._data: dict[str, list[dict]] = {} def put(self, rows: list[dict]) -> str: handle = f"res_{uuid.uuid4().hex[:8]}" self._data[handle] = rows return handle def slice(self, handle: str, offset: int = 0, limit: int = 40) -> dict: if handle not in self._data: return {"error": f"handle {handle} does not exist in this run"} rows = self._data[handle][offset:offset + limit] return {"rows": rows, "offset": offset, "total": len(self._data[handle])}store = ResultStore()def fetch_reviews(store_id: str) -> dict: """Only this return value reaches the model. Full data goes to ResultStore.""" rows = load_reviews_from_db(store_id) # your data access here handle = store.put(rows) return { "handle": handle, "total_rows": len(rows), "schema": ["text", "rating", "date", "lang"], "digest": summarize_locally(rows), # light stats only, e.g. counts per star "note": "Call fetch_rows(handle, offset, limit) if you need raw rows", }def fetch_rows(handle: str, offset: int, limit: int) -> dict: """The second tool, called only when the model asks for detail""" return store.slice(handle, offset, min(limit, 40))
Capping limit (40 rows here) is the operational trick. Without a cap, the model will politely ask for everything, and handle passing degenerates back into full-payload passing.
Tasks that genuinely need to touch every row — classification, say — shouldn't ride the conversation loop at all. Feed the handle's contents to a local batch process (small per-row calls or the Batch API) instead. The agent conversation keeps the judgment; the work happens outside the history.
Folding Old Turns — in an Order That Doesn't Break Thought Signatures
If you're retrofitting a running agent, cleaning up the history itself also pays off: before assembling the next request, replace the body of spent functionResponse parts with a stub like {"status": "processed", "handle": "res_xxxx"}.
One caution for Gemini 3-family models: thought signatures, used to validate multi-turn integrity, must not be deleted or altered. Only fold the functionResponse bodies you authored; leave model-produced parts, signatures included, untouched. I've written up the details in keeping thought signatures intact across multi-turn Function Calling.
And if what you actually want to compress is a long user conversation, that's a different problem with different constraints — covered in compressing Gemini chat history with rolling summaries. This article is strictly about the residency of machine-made payloads.
Measured Results: Input Tokens Down ~8x, p95 from 34s to 9s
After two weeks of running the review agent with handle passing plus stub folding:
Metric
Before
After
Total input tokens per run
~414,000
~52,000
p95 latency, full run
34s
9s
Monthly cost of this job (Flash-tier, measured then)
~¥1,340
~¥180
Extra fetch_rows calls
—
avg 1.3 per run
To be honest, it isn't all upside. Those fetch_rows calls add an average of 1.3 extra turns per run, so the loop is longer. But eliminating the re-sends dominates: total time and cost both dropped sharply. The richer your local digest, the fewer detail fetches you'll see — I treat that as a dial to tune in production rather than something to get right up front.
Pitfalls — Compact Too Hard and the Model Comes Back for More
Three I actually hit:
Over-tightening the budget creates a re-fetch loop. At a 2,000-token budget, the model called fetch_more four times in a row, and turn count went up instead of down. Start the budget at "all rows of the fields the task needs," and only tighten while measuring.
Vague handle lifetimes. My store wasn't persisted across process restarts, so a retried run dereferenced a stale handle and got nothing. Making ResultStore fail loudly ("does not exist in this run") let the model recover naturally by calling fetch_reviews again.
Breaking correspondence under parallel function calls. When multiple functionCall parts arrive in one turn, your functionResponse parts must match them in order and name. My stub-folding once disturbed that pairing and earned an INVALID_ARGUMENT. Fold bodies only; never touch part ordering.
Before debating designs, add the one-line log_turn_tokens to your existing loop and look at a single run's trajectory. If input jumps in a staircase from turn two onward, the compaction and handle passing here will apply directly. If it doesn't, your loop is still healthy — and confirming that takes five minutes, which is the safest possible place to start.
Share
Thank You for Reading
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.