⬡ Advanced/2026-07-03Advanced

Your Night Batch Is Causing the Morning 429s — Priority Admission Control for a Shared Gemini Quota

When bulk jobs and interactive features share one project's RPM/TPM, the bulk lane wins by default. A priority token bucket design with measurements: 429 rate 3.2% down to 0.03%.

Gemini API¹⁶⁵ rate limits architecture¹² token bucket operations⁷ production¹²⁹

✦ Premium Article

Around 8 a.m., our user-facing generation feature started throwing bursts of 429s — and only that feature. I noticed the pattern three days after I began running a nightly job that regenerates localized app descriptions in bulk. The job was scheduled for 2 a.m., but as the item count grew, its tail crept into the morning, right into the hours when real users show up. As an indie developer I tend to run several features inside a single Google Cloud project, and this shape of failure — features fighting each other for one quota — is something you will hit sooner or later if you do the same.

Gemini API rate limits (RPM and TPM) apply per model, per project. As long as everything calls from the same project, the chat feature a user is actively waiting on and the batch job nobody is waiting on draw from the same bucket. This article records how I turned that contract into a design that protects the interactive lane, with the measurements before and after.

First, Figure Out Who Actually Emptied the Bucket

429 discussions usually jump straight to retry strategy, but retries assume the quota will recover. When another feature is continuously consuming it, retries just lengthen the queue — and retry amplification makes the consumption worse. I covered how to classify retryable versus non-retryable 429s in my piece on 429 retry design by root cause; this time we start one step earlier: attribution.

The first thing I did was force every Gemini call through a thin wrapper that requires a feature tag.

import time
import threading
from collections import defaultdict, deque
from google import genai
 
client = genai.Client()
 
class TaggedGeminiClient:
    """Force a feature tag on every call and record per-minute usage."""
 
    def __init__(self, client: genai.Client):
        self._client = client
        self._lock = threading.Lock()
        # feature -> deque[(epoch_minute, requests, tokens)]
        self._usage = defaultdict(lambda: deque(maxlen=180))
 
    def generate(self, *, feature: str, model: str, contents, config=None):
        resp = self._client.models.generate_content(
            model=model, contents=contents, config=config
        )
        used = resp.usage_metadata
        total = (used.prompt_token_count or 0) + (used.candidates_token_count or 0)
        minute = int(time.time() // 60)
        with self._lock:
            bucket = self._usage[feature]
            if bucket and bucket[-1][0] == minute:
                m, r, t = bucket[-1]
                bucket[-1] = (m, r + 1, t + total)
            else:
                bucket.append((minute, 1, total))
        return resp
 
    def snapshot(self, last_minutes: int = 60):
        """Per-feature RPM / TPM over the last N minutes."""
        cutoff = int(time.time() // 60) - last_minutes
        out = {}
        with self._lock:
            for feature, buckets in self._usage.items():
                rows = [b for b in buckets if b[0] >= cutoff]
                if rows:
                    out[feature] = {
                        "avg_rpm": sum(r for _, r, _ in rows) / len(rows),
                        "peak_rpm": max(r for _, r, _ in rows),
                        "peak_tpm": max(t for _, _, t in rows),
                    }
        return out

Because it aggregates usage_metadata, you get real token counts for free. After a week of data, the picture was unambiguous: during the 7–9 a.m. peak, the bulk regeneration job owned 82% of RPM. The interactive 429s weren't caused by interactive growth at all — the batch tail had simply reached the morning.

Feature	Call pattern	Peak RPM share	Tolerable delay
Interactive generation (user-facing)	Morning/evening spikes	14%	Seconds — users feel it
Bulk description regeneration	Starts at night, runs for hours	82%	Hours — nobody is waiting
Notification draft generation	Sporadic	4%	Minutes

The "tolerable delay" column is the entire design. The problem was never total capacity; it was that workloads with wildly different delay tolerance were drawing from one bucket at equal priority.

Before/After — Putting an Admission Gate in Front of Every Call

Before, each feature called the SDK directly, the obvious way:

# Before: every feature calls whenever it likes
# interactive handler
resp = client.models.generate_content(model="gemini-flash-latest", contents=user_prompt)
 
# bulk worker (loops over thousands of items)
for item in items:
    resp = client.models.generate_content(model="gemini-flash-latest", contents=build_prompt(item))

The flaw is that you only learn about congestion when the API-side rate limiter tells you — and by the time a 429 arrives, the distinction between an interactive request and a bulk request is gone. The fix is to do the traffic sorting on your side, in front of the API, with admission control:

# After: every call passes through a priority-aware gate
gate = PriorityAdmissionGate(rpm_limit=1000, tpm_limit=1_000_000,
                             reserved_interactive_ratio=0.3)
 
# interactive handler
async with gate.acquire(feature="chat", priority="interactive", est_tokens=1200):
    resp = await async_generate(user_prompt)
 
# bulk worker
async with gate.acquire(feature="bulk_regen", priority="bulk", est_tokens=2800):
    resp = await async_generate(build_prompt(item))

The gate's contract has exactly two clauses. A fixed share of capacity (30% here) is always reserved for interactive traffic. Bulk may use everything else — and may borrow the reserved share while interactive is idle — but borrowing never flows the other way.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Feature-tagged instrumentation that identifies which feature is eating your RPM/TPM, and how to read the result (bulk owned 82% of peak RPM in my case)

✦A working Python priority token bucket that reserves capacity for interactive traffic while letting bulk borrow idle headroom — controlling both RPM and TPM

✦Two weeks of before/after numbers (429 rate 3.2%→0.03%) plus the easy-to-miss operational fact that separate API keys do not isolate quota

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

The Priority Token Bucket

Internally it's a classic token bucket split into two lanes. The only LLM-specific part is that it meters both RPM and TPM.

import asyncio
import time
from contextlib import asynccontextmanager
 
class PriorityAdmissionGate:
    def __init__(self, rpm_limit: int, tpm_limit: int,
                 reserved_interactive_ratio: float = 0.3,
                 safety_margin: float = 0.85):
        # stop short of the API's real ceiling (headroom for other processes)
        self._rpm = rpm_limit * safety_margin
        self._tpm = tpm_limit * safety_margin
        self._reserved = reserved_interactive_ratio
        self._req_tokens = self._rpm   # request-count bucket
        self._tok_tokens = self._tpm   # token-count bucket
        self._last = time.monotonic()
        self._cond = asyncio.Condition()
 
    def _refill(self):
        now = time.monotonic()
        elapsed = now - self._last
        self._last = now
        self._req_tokens = min(self._rpm, self._req_tokens + self._rpm * elapsed / 60)
        self._tok_tokens = min(self._tpm, self._tok_tokens + self._tpm * elapsed / 60)
 
    def _floor(self, priority: str) -> tuple[float, float]:
        """Bulk must stop above the reserved line; interactive can drain to zero."""
        if priority == "interactive":
            return (0.0, 0.0)
        return (self._rpm * self._reserved, self._tpm * self._reserved)
 
    @asynccontextmanager
    async def acquire(self, *, feature: str, priority: str, est_tokens: int):
        async with self._cond:
            req_floor, tok_floor = self._floor(priority)
            while True:
                self._refill()
                if (self._req_tokens - 1 >= req_floor
                        and self._tok_tokens - est_tokens >= tok_floor):
                    self._req_tokens -= 1
                    self._tok_tokens -= est_tokens
                    break
                # not enough headroom: wait for refill (bulk decelerates naturally)
                await asyncio.wait_for(self._cond.wait(), timeout=1.0)
        try:
            yield
        finally:
            async with self._cond:
                self._cond.notify_all()

The heart of it is _floor(). The bulk lane is forced to wait before the bucket drops below the 30% reserve line, so an arriving interactive request always finds capacity immediately. At night, when interactive traffic is dead, bulk can run right down to the reserve line — so total throughput barely suffers. In my nighttime measurements, this borrowing behavior gave bulk 22% more effective throughput than a naive static split that permanently carves out 30%.

est_tokens is a pre-call estimate. You could call count_tokens, but that costs a request of its own, so for this purpose I use a rough heuristic (characters ÷ 2 plus the output cap) and correct afterward with the real usage_metadata numbers. The estimate ran about +11% high on average over two weeks — plenty accurate for preventing quota exhaustion, which is all it needs to do.

The Week RPM-Only Control Failed — TPM Is a Separate Cliff

My first implementation only metered RPM. It was fine for a week, then 429s came back the day I added long-document summarization to the bulk queue — with RPM nowhere near the limit. The culprit was TPM. A task with ~9,000 prompt tokens per item is trivial in request count and voracious in token count.

Control scheme	Short-prompt week	After adding long-document tasks
RPM only	Near-zero 429s	Morning 429s return (TPM exceeded)
RPM + TPM (two-dimensional)	Near-zero 429s	Near-zero 429s (bulk self-throttles)

Per-request cost varying by an order of magnitude between features is characteristic of LLM workloads. Meter only one dimension and the "few but heavy" feature becomes a bypass. That failure is why the gate above carries two buckets from the start.

Bulk Etiquette — An External Slowdown Signal

Admission control is in-process traffic sorting; it can't see congestion the API sees (other machines, other processes, shifting limits). So my bulk workers carry one more brake: a simple circuit that watches the interactive lane's 429 rate and p95 latency at one-minute granularity and pauses bulk when either degrades.

async def bulk_worker(items, gate, health):
    for item in items:
        # if the interactive lane looks unhealthy, bulk quietly waits
        while not health.interactive_healthy():
            await asyncio.sleep(30)
        async with gate.acquire(feature="bulk_regen", priority="bulk",
                                est_tokens=estimate(item)):
            await process(item)
 
class InteractiveHealth:
    def __init__(self, max_429_rate=0.005, max_p95_sec=3.5):
        self._max_429 = max_429_rate
        self._max_p95 = max_p95_sec
        self.window = deque(maxlen=300)  # (ts, status, latency)
 
    def interactive_healthy(self) -> bool:
        recent = [w for w in self.window if w[0] > time.time() - 60]
        if not recent:
            return True
        rate_429 = sum(1 for _, s, _ in recent if s == 429) / len(recent)
        lats = sorted(l for _, _, l in recent)
        p95 = lats[int(len(lats) * 0.95) - 1] if len(lats) >= 20 else 0
        return rate_429 <= self._max_429 and p95 <= self._max_p95

A batch job's greatest asset is that pausing it hurts nobody. Automate the decision to pause the side that doesn't hurt, and on mornings when the batch tail collides with the interactive peak, bulk yields the road before damage is done. This is the defensive twin of the throughput-maximizing approach in my article on adaptive concurrency for bulk workloads — designing the offense and the defense as a pair works well.

Why "Just Use a Separate API Key" Doesn't Work

My very first idea was to issue a second API key for the bulk job. It does nothing here. Gemini API rate limits are enforced per project, not per key — split one project's traffic across ten keys and they still drain the same bucket. Key separation is meaningful for credential hygiene and permission scoping, but it is not quota isolation.

Real isolation means a separate project. Project-level separation is genuinely useful for containing blast radius, and I covered the cost-side version of that argument in my piece on spend caps and blast radius. But every extra project multiplies billing, monitoring, and key-management overhead, so my decision table looks like this:

Situation	Recommendation
Features inside one app competing for quota	Keep one project; solve with priority admission control
Containing blast radius of a runaway or compromise	Separate projects (isolates both spend and quota)
Production vs. staging	Separate projects (experiments must not eat production quota)

Two Weeks of Numbers

Two weeks before versus two weeks after, measured application-side on first attempts (before any retry), interactive feature only:

Metric	Before	After
Interactive 429 rate (morning peak)	3.2%	0.03%
Interactive p95 latency (same window)	4.1 s	2.3 s
Bulk job completion time	42 min	58 min (+16 min)
Nighttime bulk effective throughput	—	+22% vs. a static split (reserve borrowing)

Yes, the bulk job takes 16 minutes longer — that is precisely the time it spent yielding to the interactive peak, which is the design working as intended. Sixteen minutes on a night batch and a 429 in front of a waiting user are not equivalent costs. Encoding that asymmetry into the call path is what admission control is really for.

One honest caveat: if you effectively have one feature, or every call shares the same delay tolerance, this gate is pure complexity. A plain concurrency cap and a sane retry policy will serve you better. Add the gate only after measurement shows actual contention — that ordering keeps the system honest.

Where to Start — One Feature Tag

If you can't yet name the feature causing your 429s, don't start with the gate. Start with a week of feature-tagged measurement like the TaggedGeminiClient at the top. Once the culprit is visible, the reserve ratio and thresholds practically choose themselves from the table. In my case, attribution took three days of data, and the gate itself was one weekend of implementation and tuning. If your project shares one quota across features that don't share a deadline, I hope these notes save you the morning I lost.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.