I spent the better part of six months staring at a dashboard that read 520ms average TTFT while support kept forwarding the same complaint: "it freezes sometimes." This was a while after I put a Gemini-powered chat feature into production in one of my own apps. Averages do not lie, but an average does not describe the experience anyone actually has. What users complain about is the slow response that shows up a few times out of a hundred.
This is not a "make Gemini faster in general" article. It is a field note for the stage that comes after the average is already fast enough — the stage where the tail is what is breaking the experience. Built around p95/p99 as the central metric, it walks through how I rebuilt measurement, routing, timeouts, retries, and cache accounting, in the order that actually moved the numbers.
Why the average TTFT hides the problem
Latency does not follow a normal distribution; it has a long right tail. Most requests return quickly, and a small fraction are dramatically slow. With that shape, the average gets pulled down toward the fast cluster. So even with a 520ms average, a p99 (the slowest 1%) of 4,000ms is entirely ordinary.
What governs perceived speed is the thickness of that tail, not the average. In a chat UI, a single user fires 10–20 requests per session. If p99 is 1% per request, the chance that at least one of 20 requests "freezes" is about 18%. Roughly one in five users will hit a sluggish moment at least once per session. Watch only the average and you will miss this forever.
The first thing you need is telemetry that records percentiles, not averages. Emit one number per request and fold it into a histogram later.
# pip install google-genai
import time, json, math
from google import genai
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
def timed_stream(model: str, prompt: str, request_id: str):
"""Emit one structured log line of latency breakdown per request.
Keep raw numbers only, on the assumption you'll fold them into p50/p95/p99 later."""
t0 = time.perf_counter()
t_first = None
out_chars = 0
status = "ok"
try:
stream = client.models.generate_content_stream(model=model, contents=prompt)
for chunk in stream:
if t_first is None:
t_first = time.perf_counter()
if chunk.text:
out_chars += len(chunk.text)
except Exception as e:
status = type(e).__name__
t_end = time.perf_counter()
rec = {
"request_id": request_id,
"model": model,
"ttft_ms": round((t_first - t0) * 1000) if t_first else None,
"e2e_ms": round((t_end - t0) * 1000),
"out_chars": out_chars,
"status": status,
}
print(json.dumps(rec, ensure_ascii=False)) # ship to your log backend
return recThe key is not to compute the average inside the app. Once you collapse to an average, you cannot recover the distribution when you later decide you only care about p95. Keep the raw ttft_ms and let your log backend (BigQuery, or just Python on your laptop) compute percentiles.
def percentiles(values, ps=(50, 95, 99)):
"""Linear-interpolated percentiles from a sorted array.
No extra dependency; useful for a quick check right after ingesting logs."""
xs = sorted(v for v in values if v is not None)
if not xs:
return {}
out = {}
for p in ps:
k = (len(xs) - 1) * (p / 100)
lo, hi = math.floor(k), math.ceil(k)
out[f"p{p}"] = round(xs[lo] + (xs[hi] - xs[lo]) * (k - lo))
return out
# e.g. ttfts = [list of ttft_ms gathered from log lines]
# print(percentiles(ttfts)) -> {'p50': 480, 'p95': 1700, 'p99': 4200}The ratio of p50 to p99 — the tail ratio — is my single most important metric. Once p99/p50 climbs above 3, no amount of lowering the average will improve the experience. You have to hit the tail directly.
Design backward from a tail time budget
The thing that helped most with the tail was not adding techniques — it was deciding a single number first: the tail time budget.
Concretely, decide "by how many milliseconds must this request return its first token, beyond which we should cut it off and act rather than keep the user waiting?" For my chat UI, I set the TTFT budget at 1,200ms. With a p50 of 480ms there is usually plenty of headroom, but that 1,200ms becomes the origin point for every design decision.
Once the budget is fixed, the ceiling for each layer falls out automatically.
| Layer | Budget | Action on overrun |
|---|---|---|
| Client → edge | ~150ms | Co-locate region, reuse connection |
| Input processing (TTFT) | ~900ms | Cache, thinking_budget 0, downgrade model |
| Cutoff margin | ~150ms | Fire timeout → fall back |
What matters is that on overrun you cut off explicitly rather than "just wait." Wrap the Gemini call in asyncio.wait_for, and when the TTFT budget is exceeded, switch to a faster configuration.
import asyncio
from google import genai
aclient = genai.Client(api_key="YOUR_GEMINI_API_KEY").aio
async def first_token_within(model, prompt, budget_s):
"""Return the stream if the first token arrives within budget_s.
Otherwise raise TimeoutError so the caller can fall back."""
stream = await aclient.models.generate_content_stream(model=model, contents=prompt)
agen = stream.__aiter__()
first = await asyncio.wait_for(agen.__anext__(), timeout=budget_s)
return first, agen
async def answer(prompt):
try:
first, rest = await first_token_within("gemini-2.5-flash", prompt, 1.2)
except (asyncio.TimeoutError, StopAsyncIteration):
# Over budget: escape to a faster config (no thinking + lighter model)
first, rest = await first_token_within("gemini-2.5-flash-lite", prompt, 2.0)
yield first.text
async for chunk in rest:
if chunk.text:
yield chunk.textThis "downgrade-and-retry on overrun" pattern shrinks p99 dramatically at the cost of a little average. In my environment, p99 TTFT dropped from 4,200ms to 1,900ms once the fallback was in place. The fallback fires on only about 2% of traffic, so the average barely moves. The numbers confirm that you are hitting only the tail.