Making Gemini API Output Reproducible with the seed Parameter — Practical Patterns for Tests and Debugging

"I'm sending the exact same prompt and getting a different answer every time" — that's the wall most teams hit the moment they try to write tests against a Gemini-powered feature. I ran into it myself when wiring up regression tests for one of my apps, and chasing the variance ate a surprising amount of time before I finally pinned it down.

The good news is that, in most cases, the seed parameter does what you want. The less obvious news is that "just pass a seed and you'll get the same answer" is not quite accurate — there are situations where seed simply cannot stabilize the output. This article walks through how seed actually works, the patterns I rely on for tests and debugging, and the gotchas that surprise people most often.

What seed actually controls

The seed parameter on the Gemini API fixes the starting point of the pseudo-random number generator used during sampling. Send the same seed with the same prompt, the same model, and the same generation config, and the sampling sequence becomes the same — which usually means the output text becomes the same as well.

It's worth being precise about how this differs from temperature.

temperature=0.0 alone pushes the model toward greedy decoding, which is mostly deterministic but can still drift due to batching order and tiny numerical differences server-side
Adding seed constrains the sampling step itself, so the result is far more consistent run-to-run

In my own experience, seed + low temperature is significantly more stable for regression tests than just lowering temperature.

Minimal Python example

Here's a small Python sample using the official google-genai SDK. It's deliberately written to drop straight into a snapshot test under pytest.

# pip install google-genai
import os
from google import genai
from google.genai import types
 
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
 
def generate_with_seed(prompt: str, seed: int = 42) -> str:
    """Run a prompt with fixed seed and zero temperature for reproducible output."""
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt,
        config=types.GenerateContentConfig(
            temperature=0.0,
            top_p=1.0,
            seed=seed,
            max_output_tokens=512,
        ),
    )
    return response.text
 
if __name__ == "__main__":
    out_a = generate_with_seed("What is the capital of Japan? One word.")
    out_b = generate_with_seed("What is the capital of Japan? One word.")
    print(out_a)
    print(out_b)
    print("match:", out_a == out_b)

The expected output is Tokyo on both calls and match: True at the end. As an aside: a single trailing space in the prompt can flip the result, so always pin your test inputs to exact strings.

REST and Node.js variants

Calling the REST endpoint, the seed lives inside generationConfig:

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key=YOUR_GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"parts":[{"text":"What is 2+2? Answer with just the number."}]}],
    "generationConfig": {
      "temperature": 0.0,
      "topP": 1.0,
      "seed": 42,
      "maxOutputTokens": 32
    }
  }'

The Node.js SDK (@google/genai) follows the same shape:

import { GoogleGenAI } from "@google/genai";
 
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });
 
const result = await ai.models.generateContent({
  model: "gemini-2.5-flash",
  contents: "Translate 'Good morning' to French.",
  config: { temperature: 0, topP: 1, seed: 42, maxOutputTokens: 64 },
});
console.log(result.text);

REST and the SDKs route through the same backend, so identical configs produce identical results regardless of which surface you use.

Three patterns I use for tests and debugging

1. Snapshot regression tests

For prompt-driven features, I save the output of a fixed seed and temperature=0 run into a snapshot file. CI fails when the response changes, which catches "the answer silently shifted" bugs that are otherwise invisible. I walk through the full pytest setup in Building Prompt Regression Tests for Gemini API with pytest.

2. A/B prompt comparison with reduced variance

When comparing prompt A vs. prompt B, running each once with the same seed is weaker than running each three to five times across a fixed list of seeds. The latter gives you paired samples, which makes the comparison much more meaningful even if you keep temperature at a normal value to capture diversity.

3. Reproducing user-reported bugs

When a user reports a strange answer, you'll only be able to reproduce it locally if you logged the prompt together with the seed (and the model name and the temperature). Make these four fields part of your standard request log. Future-you will thank present-you.

When seed does not help

This is where most of the hours get burned. If you're passing a seed and still seeing variance, check the list below before assuming the API is broken.

Temperature is high. With temperature near 0.7–1.0, sampling space is wide enough that even a fixed seed leaves visible noise. For maximum reproducibility, keep temperature in the 0–0.2 range
Model name is an alias. Names like gemini-2.5-flash-latest may quietly point at a newer build. For stable tests, pin to a specific version such as gemini-2.5-flash
Multimodal inputs (images, PDFs). Image preprocessing has its own non-determinism, so multimodal calls are noticeably less stable than text-only ones. Keep snapshot tests on text input where possible
Streaming responses. Chunk boundaries can shift between calls. Always compare the assembled final text, not the chunks
Tool calls or grounding. External calls return different data depending on real-world state, which seed cannot freeze. Mock the tools in your tests
Long contexts near the limit. When the context fills toward the upper end, internal scheduling changes can introduce extra variance even at low temperature

The mental model that helps me: seed flattens sampling noise, but it cannot flatten external noise. Decide which kind of noise you're seeing before reaching for new parameters.

A short story: why seed matters more in evaluation than production

Early on, I treated seed as a "test-only" knob. The shift in my thinking came from running prompt evaluations side-by-side. Without seed, the same prompt scored slightly differently each run, which made tiny prompt edits indistinguishable from sampling noise. Once I locked the seed and the model version, the score started reflecting actual prompt quality rather than dice rolls.

That's the practical reason I encourage anyone serious about prompt iteration to lock seed during evaluation runs first, even before adding it to production. It changes what your numbers mean. If your evaluation harness is downstream of an LLM-as-judge, the same logic applies — fix the judge model and judge seed too, otherwise you're measuring two layers of variance at once.

Designing seed alongside temperature and top_p

Three combinations cover almost every case I deal with day-to-day.

Tests and snapshots: temperature=0, top_p=1.0, seed=fixed. Designed to match exactly
Production responses: temperature=0.2, top_p=0.95, seed=unset. A little variation reads more natural to end users
Creative generation (copywriting, brainstorming): temperature=0.9, top_p=0.95, seed=rotated through a small set. Lets you compare candidates without drifting endlessly

Treat temperature as the lever for response quality, and seed as the lever for reproducibility. Mixing the two roles tends to produce configs that nobody can reason about three months later. For picking temperature itself, see Gemini API Temperature Best Practices by Task.

A neighbor worth knowing: logprobs

Seed is the right tool for stabilizing output. When you actually want to understand why a response varies, logprobs are the complement: they expose how confident the model was at each step, which often points directly at the prompt areas worth rewriting. The implementation walkthrough is in Measuring Classification Confidence with Gemini API logprobs.

Quick sanity check: confirming seed actually works in your stack

Before relying on seed across your test suite, run a one-line confirmation in the same environment your tests use. Network proxies, alternate Vertex AI endpoints, or a mismatched SDK version can all produce subtle differences from the public API. The fastest check is to send the same minimal prompt three times back-to-back with seed=42, temperature=0 and assert all three responses are byte-identical. If they are, your stack is honoring seed correctly. If they're not, something in the path between your code and the model is dropping the seed — usually a wrapper library that forgets to forward the parameter, or an enterprise gateway that strips fields it doesn't recognize.

This five-second test has saved me from spending an afternoon debugging "non-deterministic" tests that were actually deterministic the entire time, just routed through a layer that silently discarded the seed.

What to try next

Open one of your current Gemini API call sites, add seed=42 and temperature=0, and run the same prompt twice. In most cases the two outputs will match exactly. If they don't, walk down the "when seed does not help" list — it's almost always one of those five issues.

Once you have that working, add a single snapshot test to CI for one important prompt. The moment your pipeline can detect quiet output drift, your ability to iterate on prompts steps up noticeably.