GEMINI LABJP
MODEL — Gemini 3.5 Flash is now generally available, beating 3.1 Pro on nearly all benchmarks while running 4x fasterAGENTS — Managed Agents arrive in the Gemini API in public preview, running autonomous agents in isolated Google-hosted Linux sandboxesSEARCH — File Search adds multimodal search, natively embedding and searching images via gemini-embedding-2API — Event-driven webhooks now replace polling for the Batch API and long-running operationsSTUDIO — Google AI Studio builds Android apps from plain language and generates images on the fly with Nano BananaMIGRATION — Gemini CLI reaches end-of-life on June 18; migrate to the Agentic 2.0 CLI (two image-preview models retire June 25)MODEL — Gemini 3.5 Flash is now generally available, beating 3.1 Pro on nearly all benchmarks while running 4x fasterAGENTS — Managed Agents arrive in the Gemini API in public preview, running autonomous agents in isolated Google-hosted Linux sandboxesSEARCH — File Search adds multimodal search, natively embedding and searching images via gemini-embedding-2API — Event-driven webhooks now replace polling for the Batch API and long-running operationsSTUDIO — Google AI Studio builds Android apps from plain language and generates images on the fly with Nano BananaMIGRATION — Gemini CLI reaches end-of-life on June 18; migrate to the Agentic 2.0 CLI (two image-preview models retire June 25)
Articles/API / SDK
API / SDK/2026-06-23Advanced

Your Gemini API Average Latency Looks Great — But Some Users Still Get Stuck. Defending p95/p99

Your average TTFT is fast, yet a fraction of users keep hitting frozen responses. That is a tail-latency problem (p95/p99). From measurement to model routing, streaming budgets, cache accounting, and retry design — here are the defenses that actually held up in production, with code.

Gemini API145Tail Latencyp95SLOStreaming3

Premium Article

I spent the better part of six months staring at a dashboard that read 520ms average TTFT while support kept forwarding the same complaint: "it freezes sometimes." This was a while after I put a Gemini-powered chat feature into production in one of my own apps. Averages do not lie, but an average does not describe the experience anyone actually has. What users complain about is the slow response that shows up a few times out of a hundred.

This is not a "make Gemini faster in general" article. It is a field note for the stage that comes after the average is already fast enough — the stage where the tail is what is breaking the experience. Built around p95/p99 as the central metric, it walks through how I rebuilt measurement, routing, timeouts, retries, and cache accounting, in the order that actually moved the numbers.

Why the average TTFT hides the problem

Latency does not follow a normal distribution; it has a long right tail. Most requests return quickly, and a small fraction are dramatically slow. With that shape, the average gets pulled down toward the fast cluster. So even with a 520ms average, a p99 (the slowest 1%) of 4,000ms is entirely ordinary.

What governs perceived speed is the thickness of that tail, not the average. In a chat UI, a single user fires 10–20 requests per session. If p99 is 1% per request, the chance that at least one of 20 requests "freezes" is about 18%. Roughly one in five users will hit a sluggish moment at least once per session. Watch only the average and you will miss this forever.

The first thing you need is telemetry that records percentiles, not averages. Emit one number per request and fold it into a histogram later.

# pip install google-genai
import time, json, math
from google import genai
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
def timed_stream(model: str, prompt: str, request_id: str):
    """Emit one structured log line of latency breakdown per request.
    Keep raw numbers only, on the assumption you'll fold them into p50/p95/p99 later."""
    t0 = time.perf_counter()
    t_first = None
    out_chars = 0
    status = "ok"
    try:
        stream = client.models.generate_content_stream(model=model, contents=prompt)
        for chunk in stream:
            if t_first is None:
                t_first = time.perf_counter()
            if chunk.text:
                out_chars += len(chunk.text)
    except Exception as e:
        status = type(e).__name__
    t_end = time.perf_counter()
 
    rec = {
        "request_id": request_id,
        "model": model,
        "ttft_ms": round((t_first - t0) * 1000) if t_first else None,
        "e2e_ms": round((t_end - t0) * 1000),
        "out_chars": out_chars,
        "status": status,
    }
    print(json.dumps(rec, ensure_ascii=False))  # ship to your log backend
    return rec

The key is not to compute the average inside the app. Once you collapse to an average, you cannot recover the distribution when you later decide you only care about p95. Keep the raw ttft_ms and let your log backend (BigQuery, or just Python on your laptop) compute percentiles.

def percentiles(values, ps=(50, 95, 99)):
    """Linear-interpolated percentiles from a sorted array.
    No extra dependency; useful for a quick check right after ingesting logs."""
    xs = sorted(v for v in values if v is not None)
    if not xs:
        return {}
    out = {}
    for p in ps:
        k = (len(xs) - 1) * (p / 100)
        lo, hi = math.floor(k), math.ceil(k)
        out[f"p{p}"] = round(xs[lo] + (xs[hi] - xs[lo]) * (k - lo))
    return out
 
# e.g. ttfts = [list of ttft_ms gathered from log lines]
# print(percentiles(ttfts))  ->  {'p50': 480, 'p95': 1700, 'p99': 4200}

The ratio of p50 to p99 — the tail ratio — is my single most important metric. Once p99/p50 climbs above 3, no amount of lowering the average will improve the experience. You have to hit the tail directly.

Design backward from a tail time budget

The thing that helped most with the tail was not adding techniques — it was deciding a single number first: the tail time budget.

Concretely, decide "by how many milliseconds must this request return its first token, beyond which we should cut it off and act rather than keep the user waiting?" For my chat UI, I set the TTFT budget at 1,200ms. With a p50 of 480ms there is usually plenty of headroom, but that 1,200ms becomes the origin point for every design decision.

Once the budget is fixed, the ceiling for each layer falls out automatically.

LayerBudgetAction on overrun
Client → edge~150msCo-locate region, reuse connection
Input processing (TTFT)~900msCache, thinking_budget 0, downgrade model
Cutoff margin~150msFire timeout → fall back

What matters is that on overrun you cut off explicitly rather than "just wait." Wrap the Gemini call in asyncio.wait_for, and when the TTFT budget is exceeded, switch to a faster configuration.

import asyncio
from google import genai
 
aclient = genai.Client(api_key="YOUR_GEMINI_API_KEY").aio
 
async def first_token_within(model, prompt, budget_s):
    """Return the stream if the first token arrives within budget_s.
    Otherwise raise TimeoutError so the caller can fall back."""
    stream = await aclient.models.generate_content_stream(model=model, contents=prompt)
    agen = stream.__aiter__()
    first = await asyncio.wait_for(agen.__anext__(), timeout=budget_s)
    return first, agen
 
async def answer(prompt):
    try:
        first, rest = await first_token_within("gemini-2.5-flash", prompt, 1.2)
    except (asyncio.TimeoutError, StopAsyncIteration):
        # Over budget: escape to a faster config (no thinking + lighter model)
        first, rest = await first_token_within("gemini-2.5-flash-lite", prompt, 2.0)
    yield first.text
    async for chunk in rest:
        if chunk.text:
            yield chunk.text

This "downgrade-and-retry on overrun" pattern shrinks p99 dramatically at the cost of a little average. In my environment, p99 TTFT dropped from 4,200ms to 1,900ms once the fallback was in place. The fallback fires on only about 2% of traffic, so the average barely moves. The numbers confirm that you are hitting only the tail.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A minimal telemetry setup that records p95/p99 instead of averages, and how to read the histogram
Designing model routing and retries backward from a single tail time budget (timeout)
How to separate the p99 impact of cache hit rate, thinking budget, and connection reuse
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-04-29
Production Streaming UI with Gemini API + TanStack Query — Cancellation, Retries, and Cache Coherence
TanStack Query is optimized for one-shot REST/JSON requests, so streaming responses don't fit naturally. This guide walks through the gotchas of using Gemini API SSE with TanStack Query and the production-grade design patterns that hold up in real apps.
API / SDK2026-03-19
Unity × Gemini Multimodal Complete Implementation — Advanced Code Collection
Complete production-ready Unity + Gemini implementation: Streaming responses, image recognition, voice dialogue, context management. 65% latency reduction, 34% UX satisfaction improvement.
API / SDK2026-06-22
Gemini API on Google Cloud: Diagnosing Production Errors Layer by Layer
Systematically diagnose Gemini API errors in Google Cloud production environments. Covers IAM permissions, Vertex AI vs AI Studio, VPC Service Controls, quota management, service accounts, and multi-region failover with full code examples.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →