GEMINI LABJP
API — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAIENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companiesAGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choicesSPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContentDATA — Crossbeam data stores can now connect to Gemini Enterprise in public previewMODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloadsAPI — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAIENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companiesAGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choicesSPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContentDATA — Crossbeam data stores can now connect to Gemini Enterprise in public previewMODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloads
Articles/Advanced
Advanced/2026-07-03Advanced

Your Night Batch Is Causing the Morning 429s — Priority Admission Control for a Shared Gemini Quota

When bulk jobs and interactive features share one project's RPM/TPM, the bulk lane wins by default. A priority token bucket design with measurements: 429 rate 3.2% down to 0.03%.

Gemini API165rate limitsarchitecture12token bucketoperations7production129

Premium Article

Around 8 a.m., our user-facing generation feature started throwing bursts of 429s — and only that feature. I noticed the pattern three days after I began running a nightly job that regenerates localized app descriptions in bulk. The job was scheduled for 2 a.m., but as the item count grew, its tail crept into the morning, right into the hours when real users show up. As an indie developer I tend to run several features inside a single Google Cloud project, and this shape of failure — features fighting each other for one quota — is something you will hit sooner or later if you do the same.

Gemini API rate limits (RPM and TPM) apply per model, per project. As long as everything calls from the same project, the chat feature a user is actively waiting on and the batch job nobody is waiting on draw from the same bucket. This article records how I turned that contract into a design that protects the interactive lane, with the measurements before and after.

First, Figure Out Who Actually Emptied the Bucket

429 discussions usually jump straight to retry strategy, but retries assume the quota will recover. When another feature is continuously consuming it, retries just lengthen the queue — and retry amplification makes the consumption worse. I covered how to classify retryable versus non-retryable 429s in my piece on 429 retry design by root cause; this time we start one step earlier: attribution.

The first thing I did was force every Gemini call through a thin wrapper that requires a feature tag.

import time
import threading
from collections import defaultdict, deque
from google import genai
 
client = genai.Client()
 
class TaggedGeminiClient:
    """Force a feature tag on every call and record per-minute usage."""
 
    def __init__(self, client: genai.Client):
        self._client = client
        self._lock = threading.Lock()
        # feature -> deque[(epoch_minute, requests, tokens)]
        self._usage = defaultdict(lambda: deque(maxlen=180))
 
    def generate(self, *, feature: str, model: str, contents, config=None):
        resp = self._client.models.generate_content(
            model=model, contents=contents, config=config
        )
        used = resp.usage_metadata
        total = (used.prompt_token_count or 0) + (used.candidates_token_count or 0)
        minute = int(time.time() // 60)
        with self._lock:
            bucket = self._usage[feature]
            if bucket and bucket[-1][0] == minute:
                m, r, t = bucket[-1]
                bucket[-1] = (m, r + 1, t + total)
            else:
                bucket.append((minute, 1, total))
        return resp
 
    def snapshot(self, last_minutes: int = 60):
        """Per-feature RPM / TPM over the last N minutes."""
        cutoff = int(time.time() // 60) - last_minutes
        out = {}
        with self._lock:
            for feature, buckets in self._usage.items():
                rows = [b for b in buckets if b[0] >= cutoff]
                if rows:
                    out[feature] = {
                        "avg_rpm": sum(r for _, r, _ in rows) / len(rows),
                        "peak_rpm": max(r for _, r, _ in rows),
                        "peak_tpm": max(t for _, _, t in rows),
                    }
        return out

Because it aggregates usage_metadata, you get real token counts for free. After a week of data, the picture was unambiguous: during the 7–9 a.m. peak, the bulk regeneration job owned 82% of RPM. The interactive 429s weren't caused by interactive growth at all — the batch tail had simply reached the morning.

FeatureCall patternPeak RPM shareTolerable delay
Interactive generation (user-facing)Morning/evening spikes14%Seconds — users feel it
Bulk description regenerationStarts at night, runs for hours82%Hours — nobody is waiting
Notification draft generationSporadic4%Minutes

The "tolerable delay" column is the entire design. The problem was never total capacity; it was that workloads with wildly different delay tolerance were drawing from one bucket at equal priority.

Before/After — Putting an Admission Gate in Front of Every Call

Before, each feature called the SDK directly, the obvious way:

# Before: every feature calls whenever it likes
# interactive handler
resp = client.models.generate_content(model="gemini-flash-latest", contents=user_prompt)
 
# bulk worker (loops over thousands of items)
for item in items:
    resp = client.models.generate_content(model="gemini-flash-latest", contents=build_prompt(item))

The flaw is that you only learn about congestion when the API-side rate limiter tells you — and by the time a 429 arrives, the distinction between an interactive request and a bulk request is gone. The fix is to do the traffic sorting on your side, in front of the API, with admission control:

# After: every call passes through a priority-aware gate
gate = PriorityAdmissionGate(rpm_limit=1000, tpm_limit=1_000_000,
                             reserved_interactive_ratio=0.3)
 
# interactive handler
async with gate.acquire(feature="chat", priority="interactive", est_tokens=1200):
    resp = await async_generate(user_prompt)
 
# bulk worker
async with gate.acquire(feature="bulk_regen", priority="bulk", est_tokens=2800):
    resp = await async_generate(build_prompt(item))

The gate's contract has exactly two clauses. A fixed share of capacity (30% here) is always reserved for interactive traffic. Bulk may use everything else — and may borrow the reserved share while interactive is idle — but borrowing never flows the other way.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Feature-tagged instrumentation that identifies which feature is eating your RPM/TPM, and how to read the result (bulk owned 82% of peak RPM in my case)
A working Python priority token bucket that reserves capacity for interactive traffic while letting bulk borrow idle headroom — controlling both RPM and TPM
Two weeks of before/after numbers (429 rate 3.2%→0.03%) plus the easy-to-miss operational fact that separate API keys do not isolate quota
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Advanced2026-06-18
Restarting a Long Agent Run From Where It Broke — A Step-Ledger Design for Gemini 3.5 Flash Long-Horizon Tasks
Gemini 3.5 Flash is good at long-horizon tasks, but when a 40-step run dies on step 29, you usually start over. An append-only step ledger gives you resume, idempotency, and audit in one place. Here is the design with working Python and measured results.
Advanced2026-04-20
to Production Architecture for Gemini API 2026— Design Patterns for Building Scalable, Reliable AI Systems
A comprehensive guide to production-grade design patterns for Gemini API. Covers resilient API clients, multi-layer caching, multi-tenant design, observability, and cost control with complete code examples.
Advanced2026-04-07
Gemini 2.5 Flash Thinking — Integrating Thought Traces and Advanced Reasoning into Production Systems
A complete guide to using Gemini 2.5 Flash Thinking's thought trace API in production. Covers thinking budget control, streaming thought display, multi-turn reasoning chains, cost optimization, and robust fallback strategies.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →