GEMINI LABJP
SEARCH — File Search grounding now adds media_id for visual citations and page numbers, so you can trace the exact sourceAPI — Event-driven Webhooks replace polling for the Batch API and long-running operationsDEPRECATION — Two image preview models (e.g. gemini-3.1-flash-image-preview) shut down June 25; migrate dependent automation nowMODEL — Gemini 3.5 Flash is GA, beating 3.1 Pro on nearly every benchmark while running 4x fasterAGENTS — Managed Agents are in public preview on the Gemini API, running autonomous agents in isolated Linux sandboxesSTUDIO — Google AI Studio can now generate Android apps from natural-language promptsSEARCH — File Search grounding now adds media_id for visual citations and page numbers, so you can trace the exact sourceAPI — Event-driven Webhooks replace polling for the Batch API and long-running operationsDEPRECATION — Two image preview models (e.g. gemini-3.1-flash-image-preview) shut down June 25; migrate dependent automation nowMODEL — Gemini 3.5 Flash is GA, beating 3.1 Pro on nearly every benchmark while running 4x fasterAGENTS — Managed Agents are in public preview on the Gemini API, running autonomous agents in isolated Linux sandboxesSTUDIO — Google AI Studio can now generate Android apps from natural-language prompts
Articles/API / SDK
API / SDK/2026-06-24Advanced

When Gemini API Quietly Dies on the Edge from Subrequest Limits — Field Notes on Budgeting What's Left

Running Gemini API on Cloudflare Workers is calm until traffic rises or a tool chain deepens, and then it fails on the subrequest limit. Here are the instrumentation patterns I use to measure per-request consumption and treat it as a budget, drawn from the sites I run as an indie developer.

gemini-api246cloudflare-workers7edge-runtimesubrequestobservability7

Premium Article

A Gemini API endpoint that had been stable in production for weeks started returning "Too many subrequests," but only during a busier window of the day. The logs showed the usual single call to Gemini. Yet the Workers subrequest count had climbed past 50. After moving the sites I run as an indie developer onto Cloudflare Workers, this invisible-until-busy ceiling caught me twice. The first time cost me half a day; the second, instrumentation caught it in five minutes. The difference was whether I treated subrequests as "something to fix when it errors" or as "a budget each request carries."

This is a working note on turning that budget mindset into code. Memorizing the threshold number matters far less than being able to measure how many subrequests your own request actually spends — that capability keeps paying off long after the incident.

Why it never reproduces locally

Cloudflare Workers caps the total number of outbound connections a single request can issue — fetch, the Cache API, KV, D1, R2, reaching a Durable Object, and so on. At the time of writing that is 50 on Free and 1,000 on Workers Standard. Your call to Gemini counts as one of them, naturally.

The trap is that connections you didn't explicitly write are counted too. Run Next.js on Workers and a cache-miss ASSETS fetch becomes a subrequest. Ship access logs from middleware and that's one per request. You think you're calling Gemini once, but the framework and surrounding work can silently stack 20 to 30 connections behind it.

Your local dev server has no such ceiling, so wrangler dev won't reproduce it. To observe production-equivalent behavior you need wrangler dev --remote; to watch live traffic you need wrangler tail. Diagnosis starts by getting those in place.

Count what's left before you swap models on a hunch

Seeing the error, the first instinct is to switch to a lighter model. But the subrequest ceiling has nothing to do with model weight. The first move is to learn, as a number, how many connections one request actually spends. Open wrangler tail, push real traffic through, and watch outbound connections per user action. That's where you finally notice the gap: "my code only fetches once, but I'm seeing 25."

In my experience the sources collapse into three. The first is Function Calling chain depth: each time the model calls a tool, takes the result, and calls another, a new round trip is added. A depth-5 chain, with the external APIs the tools hit, easily reaches double digits. The second is SDK auto-retry: @google/genai quietly retries on 429 or 503, so three attempts can be billed inside the one request the user sees. The third is the small, easy-to-miss per-request sends — logs, metrics, config fetches.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A BudgetedFetch wrapper that measures and surfaces the subrequests a single request actually spends
A diagnostic routine that isolates what tool chains, SDK auto-retries, and logging consume beneath the surface
A design for degrading gracefully before you hit the ceiling, using budget allocation and waitUntil batching
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-06-22
Before Free Users Quietly Eat Your Margin: Tier Design and Cost Ceilings for Gemini API Apps
Protecting the margin on a Gemini-powered app means designing around a per-user monthly cost ceiling, not request counts. Tier-aware model routing, real-cost metering in KV, and the token-bloat traps that drain profit, with working code.
API / SDK2026-05-28
Running an SLO and Error Budget for the Gemini API as an Indie Developer — Guarding Four Sites with Burn-Rate Monitoring
Notes from running the Gemini API inside four production sites as an indie developer. A practical SLO and Error Budget design that fits a single-person operation: Cloudflare Workers and KV for burn-rate calculation, simplified multi-window alerts, and decision rules for what to freeze when the budget runs out.
API / SDK2026-05-23
Gemini API × Sentry: A Production Pipeline for LLM Error Tracking and Prompt Failure Observability
Pair Sentry's error tracking with Gemini-specific failure modes so you can catch safety filter blocks, recitation rejections, empty completions, and quiet latency drift in production.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →