GEMINI LABJP
API — Gemini 3.5 Flash is generally available and now powers gemini-flash-latest for sustained agentic and coding performanceAGENT — Managed Agents enter public preview, running stateful autonomous agents in Google-hosted isolated Linux sandboxesSEARCH — File Search adds multimodal search, embedding and searching images natively with gemini-embedding-2RESEARCH — A new Deep Research agent adds collaborative planning, visualization, MCP server integration, and File SearchSHEETS — Gemini in Sheets analyzes surrounding data to diagnose and fix formula errors in one clickROADMAP — Gemini 3.5 Pro slips to July for refinement; the Flash line leads for nowAPI — Gemini 3.5 Flash is generally available and now powers gemini-flash-latest for sustained agentic and coding performanceAGENT — Managed Agents enter public preview, running stateful autonomous agents in Google-hosted isolated Linux sandboxesSEARCH — File Search adds multimodal search, embedding and searching images natively with gemini-embedding-2RESEARCH — A new Deep Research agent adds collaborative planning, visualization, MCP server integration, and File SearchSHEETS — Gemini in Sheets analyzes surrounding data to diagnose and fix formula errors in one clickROADMAP — Gemini 3.5 Pro slips to July for refinement; the Flash line leads for now
Articles/API / SDK
API / SDK/2026-06-27Advanced

Don't Retry Every Gemini 429 — Telling Rate Limits Apart From Spend Cap Exhaustion

A 429 RESOURCE_EXHAUSTED can mean 'wait a second and it clears' or 'you're out of budget for the month.' Now that Project Spend Caps is generally available, the second case is real in production. Here's how to classify the two and build a retry layer plus a circuit breaker around them.

gemini-api251rate-limit4retry6spend-capproduction121

Premium Article

Running Gemini behind a wallpaper app I maintain on my own, 429 RESOURCE_EXHAUSTED is not a rare error. The problem was that for a long time I didn't notice there are two kinds. One is a transient rate limit — you sent too much in the same second, and waiting a few hundred milliseconds clears it. The other is exhaustion — this project has spent its budget for the month, and no amount of waiting or retrying will get you through until the calendar flips.

Handling both with the same exponential backoff means the retry layer quietly thrashes on the second case. With a setting of up to seven retries per request, your app keeps pounding an exhausted project with seven times the doomed traffic, and to the user it just looks like an app that's mysteriously slow to load. For an ad-supported free app, that latency turns straight into churn.

On June 26, 2026, Project Spend Caps became generally available, letting you set a per-project monthly dollar ceiling. It's a welcome way to cap costs structurally — but it also reliably raises the odds of hitting a "you're over the cap" 429 in production. Which means a design that retries every 429 uniformly is exactly the thing worth revisiting right now. Separating projects at the structural level is covered in splitting Spend Cap blast radius by tier; this article focuses on degrading at request time.

Split 429 into "wait and it clears" and "waiting won't help"

The first move is to classify the 429 before it ever reaches the retry layer. There are three signals to lean on.

The first is google.rpc.RetryInfo in the error response. When the server explicitly says "you may retry after this delay," it includes a retryDelay field. A 429 carrying that is, by design, a rate limit you're allowed to retry.

The second is the QuotaFailure detail, which tells you which quota dimension you tripped (requests-per-minute, tokens-per-minute, and so on). A per-second or per-minute quota recovers if you wait; a daily or monthly ceiling operates on a completely different time scale.

The third — and the most important — is information only you hold: your own monthly spend gate. Trying to determine "did I hit the Spend Cap?" purely from the API's error body produces a brittle implementation that depends on the fine shape of the error. Instead, keep a rough running total of "how much have I spent this month?" on your side and make that the primary axis of classification. Treat the API details as a supporting signal only.

SignalMeaningRetry decision
RetryInfo.retryDelay presentServer expects recovery after a stated waitRetryable (wait the stated seconds)
QuotaFailure is a per-minute quotaRPM/TPM exceeded; recovers soonRetryable (backoff)
Your monthly spend gate is over the lineLikely out of budget for the monthNot retryable (degrade)
No RetryInfo, repeated unexplained exhaustionUndeterminable but not recoveringTrip the breaker conservatively

The design rule here is: when in doubt, don't call. Retrying costs you time and a sliver of latency budget, but pounding an exhausted project buys you nothing at all.

Implementing the classifier

Using Gemini's Python SDK (google-genai), here's a classifier that reads those signals off the exception. Exception attribute names drift between SDK versions, so the trick is to extract things defensively rather than depend on one specific attribute.

# pip install google-genai
from dataclasses import dataclass
from enum import Enum
import json
import re
 
 
class Verdict(Enum):
    RETRYABLE = "retryable"        # clears with a wait (backoff OK)
    TERMINAL = "terminal"          # pointless this month (degrade)
    UNKNOWN = "unknown"            # undeterminable (trip conservatively)
 
 
@dataclass
class Classification:
    verdict: Verdict
    retry_after_s: float | None    # server-stated wait, if any
    reason: str
 
 
def _extract_details(err) -> dict:
    """Pull structured details off the exception, absorbing SDK differences."""
    # google-genai's APIError often carries .code / .status / .details,
    # but versions vary, so probe with getattr and fall back to the string body.
    payload = {}
    for attr in ("details", "response_json", "args"):
        val = getattr(err, attr, None)
        if isinstance(val, dict):
            payload = val
            break
        if isinstance(val, (list, tuple)) and val and isinstance(val[0], dict):
            payload = val[0]
            break
    if not payload:
        # last resort: scrape a JSON fragment out of the stringified body
        text = str(getattr(err, "message", "") or err)
        m = re.search(r"\{.*\}", text, re.DOTALL)
        if m:
            try:
                payload = json.loads(m.group(0))
            except json.JSONDecodeError:
                payload = {}
    return payload
 
 
def _retry_delay_seconds(details: dict) -> float | None:
    """Convert google.rpc.RetryInfo retryDelay (e.g. "5s") into seconds."""
    error = details.get("error", details)
    for d in error.get("details", []):
        t = d.get("@type", "")
        if "RetryInfo" in t:
            raw = d.get("retryDelay", "")
            m = re.match(r"(\d+(?:\.\d+)?)s", str(raw))
            if m:
                return float(m.group(1))
    return None
 
 
def _quota_dimension(details: dict) -> str | None:
    """Read the quota ID from QuotaFailure (a hint for per-minute vs not)."""
    error = details.get("error", details)
    for d in error.get("details", []):
        if "QuotaFailure" in d.get("@type", ""):
            for v in d.get("violations", []):
                qid = v.get("quotaId") or v.get("subject") or ""
                if qid:
                    return qid
    return None
 
 
def classify_429(err, monthly_budget_exhausted: bool) -> Classification:
    """Classify a 429 into three buckets. monthly_budget_exhausted comes
    from your own spend gate."""
    details = _extract_details(err)
    delay = _retry_delay_seconds(details)
    qid = _quota_dimension(details) or ""
 
    # If your own gate says "done for the month," trust that first.
    if monthly_budget_exhausted:
        return Classification(Verdict.TERMINAL, None, "monthly spend gate exhausted")
 
    # Server stated a wait -> rate limit. Just wait.
    if delay is not None:
        return Classification(Verdict.RETRYABLE, delay, f"server RetryInfo={delay}s")
 
    # Hit a per-minute quota (PerMinute, etc.) -> recovers with a wait
    if re.search(r"(per[-_ ]?minute|PerMinute|RPM|TPM)", qid, re.IGNORECASE):
        return Classification(Verdict.RETRYABLE, None, f"per-minute quota: {qid}")
 
    # Daily/monthly/project exhaustion -> waiting generally won't fix it
    if re.search(r"(per[-_ ]?day|PerDay|monthly|project)", qid, re.IGNORECASE):
        return Classification(Verdict.TERMINAL, None, f"long-window quota: {qid}")
 
    # Exhaustion with no RetryInfo and no readable dimension -> undeterminable
    return Classification(Verdict.UNKNOWN, None, "no RetryInfo, unknown quota")

The key is that monthly_budget_exhausted — your own boolean — is trusted above everything else. That value isn't a guess; it's a fact grounded in your own records. The API's error shape may change in the future, but the verdict "my estimated spend this month hit the ceiling" is owned by your code. Robustness in the Spend Cap era starts with not delegating that judgment to the server.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
If you've been hammering every 429 with exponential backoff, you'll get a classifier that mechanically decides retryable vs. terminal, ready to drop in today
When a Project Spend Cap is hit, you'll have a circuit that quietly degrades to cache or a cheaper model instead of burning latency on doomed retries
You'll learn how to combine three signals — RetryInfo, QuotaFailure, and your own monthly spend gate — into a single 'should I even call?' decision
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-04-11
Gemini API Rate Limits and 429 Handling: Operational Notes from an Indie Mobile App
Operational notes on handling Gemini API rate limits and 429 errors in a production indie mobile app: exponential backoff, adaptive control, multi-key pooling, and Cloud Monitoring integration, all rebuilt after a real incident.
API / SDK2026-06-26
When Gemini's Safety Filter Silently Drops Legitimate Output — Field Notes on Catching False Positives Without Turning Everything Off
Field notes on handling Gemini API false positives in production without disabling every category. Separating input blocks from output blocks, instrumenting per-category false-positive rates, and recovering by relaxing only the offending category.
API / SDK2026-06-25
Gemini API × TypeScript Type-Safe AI Application Architecture — Integrating Zod Schemas, Structured Output, and Streaming
Learn how to build type-safe AI applications with the Gemini API and TypeScript. This guide covers Zod validation, Structured Output, streaming pipelines, and robust error handling for production architectures.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →