●API — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAI●ENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companies●AGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choices●SPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContent●DATA — Crossbeam data stores can now connect to Gemini Enterprise in public preview●MODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloads●API — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAI●ENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companies●AGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choices●SPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContent●DATA — Crossbeam data stores can now connect to Gemini Enterprise in public preview●MODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloads
Your Night Batch Is Causing the Morning 429s — Priority Admission Control for a Shared Gemini Quota
When bulk jobs and interactive features share one project's RPM/TPM, the bulk lane wins by default. A priority token bucket design with measurements: 429 rate 3.2% down to 0.03%.
Around 8 a.m., our user-facing generation feature started throwing bursts of 429s — and only that feature. I noticed the pattern three days after I began running a nightly job that regenerates localized app descriptions in bulk. The job was scheduled for 2 a.m., but as the item count grew, its tail crept into the morning, right into the hours when real users show up. As an indie developer I tend to run several features inside a single Google Cloud project, and this shape of failure — features fighting each other for one quota — is something you will hit sooner or later if you do the same.
Gemini API rate limits (RPM and TPM) apply per model, per project. As long as everything calls from the same project, the chat feature a user is actively waiting on and the batch job nobody is waiting on draw from the same bucket. This article records how I turned that contract into a design that protects the interactive lane, with the measurements before and after.
First, Figure Out Who Actually Emptied the Bucket
429 discussions usually jump straight to retry strategy, but retries assume the quota will recover. When another feature is continuously consuming it, retries just lengthen the queue — and retry amplification makes the consumption worse. I covered how to classify retryable versus non-retryable 429s in my piece on 429 retry design by root cause; this time we start one step earlier: attribution.
The first thing I did was force every Gemini call through a thin wrapper that requires a feature tag.
import timeimport threadingfrom collections import defaultdict, dequefrom google import genaiclient = genai.Client()class TaggedGeminiClient: """Force a feature tag on every call and record per-minute usage.""" def __init__(self, client: genai.Client): self._client = client self._lock = threading.Lock() # feature -> deque[(epoch_minute, requests, tokens)] self._usage = defaultdict(lambda: deque(maxlen=180)) def generate(self, *, feature: str, model: str, contents, config=None): resp = self._client.models.generate_content( model=model, contents=contents, config=config ) used = resp.usage_metadata total = (used.prompt_token_count or 0) + (used.candidates_token_count or 0) minute = int(time.time() // 60) with self._lock: bucket = self._usage[feature] if bucket and bucket[-1][0] == minute: m, r, t = bucket[-1] bucket[-1] = (m, r + 1, t + total) else: bucket.append((minute, 1, total)) return resp def snapshot(self, last_minutes: int = 60): """Per-feature RPM / TPM over the last N minutes.""" cutoff = int(time.time() // 60) - last_minutes out = {} with self._lock: for feature, buckets in self._usage.items(): rows = [b for b in buckets if b[0] >= cutoff] if rows: out[feature] = { "avg_rpm": sum(r for _, r, _ in rows) / len(rows), "peak_rpm": max(r for _, r, _ in rows), "peak_tpm": max(t for _, _, t in rows), } return out
Because it aggregates usage_metadata, you get real token counts for free. After a week of data, the picture was unambiguous: during the 7–9 a.m. peak, the bulk regeneration job owned 82% of RPM. The interactive 429s weren't caused by interactive growth at all — the batch tail had simply reached the morning.
Feature
Call pattern
Peak RPM share
Tolerable delay
Interactive generation (user-facing)
Morning/evening spikes
14%
Seconds — users feel it
Bulk description regeneration
Starts at night, runs for hours
82%
Hours — nobody is waiting
Notification draft generation
Sporadic
4%
Minutes
The "tolerable delay" column is the entire design. The problem was never total capacity; it was that workloads with wildly different delay tolerance were drawing from one bucket at equal priority.
Before/After — Putting an Admission Gate in Front of Every Call
Before, each feature called the SDK directly, the obvious way:
# Before: every feature calls whenever it likes# interactive handlerresp = client.models.generate_content(model="gemini-flash-latest", contents=user_prompt)# bulk worker (loops over thousands of items)for item in items: resp = client.models.generate_content(model="gemini-flash-latest", contents=build_prompt(item))
The flaw is that you only learn about congestion when the API-side rate limiter tells you — and by the time a 429 arrives, the distinction between an interactive request and a bulk request is gone. The fix is to do the traffic sorting on your side, in front of the API, with admission control:
# After: every call passes through a priority-aware gategate = PriorityAdmissionGate(rpm_limit=1000, tpm_limit=1_000_000, reserved_interactive_ratio=0.3)# interactive handlerasync with gate.acquire(feature="chat", priority="interactive", est_tokens=1200): resp = await async_generate(user_prompt)# bulk workerasync with gate.acquire(feature="bulk_regen", priority="bulk", est_tokens=2800): resp = await async_generate(build_prompt(item))
The gate's contract has exactly two clauses. A fixed share of capacity (30% here) is always reserved for interactive traffic. Bulk may use everything else — and may borrow the reserved share while interactive is idle — but borrowing never flows the other way.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Feature-tagged instrumentation that identifies which feature is eating your RPM/TPM, and how to read the result (bulk owned 82% of peak RPM in my case)
✦A working Python priority token bucket that reserves capacity for interactive traffic while letting bulk borrow idle headroom — controlling both RPM and TPM
✦Two weeks of before/after numbers (429 rate 3.2%→0.03%) plus the easy-to-miss operational fact that separate API keys do not isolate quota
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Internally it's a classic token bucket split into two lanes. The only LLM-specific part is that it meters both RPM and TPM.
import asyncioimport timefrom contextlib import asynccontextmanagerclass PriorityAdmissionGate: def __init__(self, rpm_limit: int, tpm_limit: int, reserved_interactive_ratio: float = 0.3, safety_margin: float = 0.85): # stop short of the API's real ceiling (headroom for other processes) self._rpm = rpm_limit * safety_margin self._tpm = tpm_limit * safety_margin self._reserved = reserved_interactive_ratio self._req_tokens = self._rpm # request-count bucket self._tok_tokens = self._tpm # token-count bucket self._last = time.monotonic() self._cond = asyncio.Condition() def _refill(self): now = time.monotonic() elapsed = now - self._last self._last = now self._req_tokens = min(self._rpm, self._req_tokens + self._rpm * elapsed / 60) self._tok_tokens = min(self._tpm, self._tok_tokens + self._tpm * elapsed / 60) def _floor(self, priority: str) -> tuple[float, float]: """Bulk must stop above the reserved line; interactive can drain to zero.""" if priority == "interactive": return (0.0, 0.0) return (self._rpm * self._reserved, self._tpm * self._reserved) @asynccontextmanager async def acquire(self, *, feature: str, priority: str, est_tokens: int): async with self._cond: req_floor, tok_floor = self._floor(priority) while True: self._refill() if (self._req_tokens - 1 >= req_floor and self._tok_tokens - est_tokens >= tok_floor): self._req_tokens -= 1 self._tok_tokens -= est_tokens break # not enough headroom: wait for refill (bulk decelerates naturally) await asyncio.wait_for(self._cond.wait(), timeout=1.0) try: yield finally: async with self._cond: self._cond.notify_all()
The heart of it is _floor(). The bulk lane is forced to wait before the bucket drops below the 30% reserve line, so an arriving interactive request always finds capacity immediately. At night, when interactive traffic is dead, bulk can run right down to the reserve line — so total throughput barely suffers. In my nighttime measurements, this borrowing behavior gave bulk 22% more effective throughput than a naive static split that permanently carves out 30%.
est_tokens is a pre-call estimate. You could call count_tokens, but that costs a request of its own, so for this purpose I use a rough heuristic (characters ÷ 2 plus the output cap) and correct afterward with the real usage_metadata numbers. The estimate ran about +11% high on average over two weeks — plenty accurate for preventing quota exhaustion, which is all it needs to do.
The Week RPM-Only Control Failed — TPM Is a Separate Cliff
My first implementation only metered RPM. It was fine for a week, then 429s came back the day I added long-document summarization to the bulk queue — with RPM nowhere near the limit. The culprit was TPM. A task with ~9,000 prompt tokens per item is trivial in request count and voracious in token count.
Control scheme
Short-prompt week
After adding long-document tasks
RPM only
Near-zero 429s
Morning 429s return (TPM exceeded)
RPM + TPM (two-dimensional)
Near-zero 429s
Near-zero 429s (bulk self-throttles)
Per-request cost varying by an order of magnitude between features is characteristic of LLM workloads. Meter only one dimension and the "few but heavy" feature becomes a bypass. That failure is why the gate above carries two buckets from the start.
Bulk Etiquette — An External Slowdown Signal
Admission control is in-process traffic sorting; it can't see congestion the API sees (other machines, other processes, shifting limits). So my bulk workers carry one more brake: a simple circuit that watches the interactive lane's 429 rate and p95 latency at one-minute granularity and pauses bulk when either degrades.
async def bulk_worker(items, gate, health): for item in items: # if the interactive lane looks unhealthy, bulk quietly waits while not health.interactive_healthy(): await asyncio.sleep(30) async with gate.acquire(feature="bulk_regen", priority="bulk", est_tokens=estimate(item)): await process(item)class InteractiveHealth: def __init__(self, max_429_rate=0.005, max_p95_sec=3.5): self._max_429 = max_429_rate self._max_p95 = max_p95_sec self.window = deque(maxlen=300) # (ts, status, latency) def interactive_healthy(self) -> bool: recent = [w for w in self.window if w[0] > time.time() - 60] if not recent: return True rate_429 = sum(1 for _, s, _ in recent if s == 429) / len(recent) lats = sorted(l for _, _, l in recent) p95 = lats[int(len(lats) * 0.95) - 1] if len(lats) >= 20 else 0 return rate_429 <= self._max_429 and p95 <= self._max_p95
A batch job's greatest asset is that pausing it hurts nobody. Automate the decision to pause the side that doesn't hurt, and on mornings when the batch tail collides with the interactive peak, bulk yields the road before damage is done. This is the defensive twin of the throughput-maximizing approach in my article on adaptive concurrency for bulk workloads — designing the offense and the defense as a pair works well.
Why "Just Use a Separate API Key" Doesn't Work
My very first idea was to issue a second API key for the bulk job. It does nothing here. Gemini API rate limits are enforced per project, not per key — split one project's traffic across ten keys and they still drain the same bucket. Key separation is meaningful for credential hygiene and permission scoping, but it is not quota isolation.
Real isolation means a separate project. Project-level separation is genuinely useful for containing blast radius, and I covered the cost-side version of that argument in my piece on spend caps and blast radius. But every extra project multiplies billing, monitoring, and key-management overhead, so my decision table looks like this:
Situation
Recommendation
Features inside one app competing for quota
Keep one project; solve with priority admission control
Containing blast radius of a runaway or compromise
Separate projects (isolates both spend and quota)
Production vs. staging
Separate projects (experiments must not eat production quota)
Two Weeks of Numbers
Two weeks before versus two weeks after, measured application-side on first attempts (before any retry), interactive feature only:
Metric
Before
After
Interactive 429 rate (morning peak)
3.2%
0.03%
Interactive p95 latency (same window)
4.1 s
2.3 s
Bulk job completion time
42 min
58 min (+16 min)
Nighttime bulk effective throughput
—
+22% vs. a static split (reserve borrowing)
Yes, the bulk job takes 16 minutes longer — that is precisely the time it spent yielding to the interactive peak, which is the design working as intended. Sixteen minutes on a night batch and a 429 in front of a waiting user are not equivalent costs. Encoding that asymmetry into the call path is what admission control is really for.
One honest caveat: if you effectively have one feature, or every call shares the same delay tolerance, this gate is pure complexity. A plain concurrency cap and a sane retry policy will serve you better. Add the gate only after measurement shows actual contention — that ordering keeps the system honest.
Where to Start — One Feature Tag
If you can't yet name the feature causing your 429s, don't start with the gate. Start with a week of feature-tagged measurement like the TaggedGeminiClient at the top. Once the culprit is visible, the reserve ratio and thresholds practically choose themselves from the table. In my case, attribution took three days of data, and the gate itself was one weekend of implementation and tuning. If your project shares one quota across features that don't share a deadline, I hope these notes save you the morning I lost.
Share
Thank You for Reading
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.