GEMINI LABJP
MODEL — Gemini 3.5 Flash reaches GA and now powers gemini-flash-latestAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Google-hosted Linux sandboxesSEARCH — File Search adds multimodal search, embedding and searching images natively via gemini-embedding-2WEBHOOK — Event-driven webhooks arrive for the Batch API and long-running operations, replacing pollingEMBED — gemini-embedding-2 is now generally available for production embeddingsDEPRECATION — Several image generation models shut down on August 17, so plan migrations nowMODEL — Gemini 3.5 Flash reaches GA and now powers gemini-flash-latestAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Google-hosted Linux sandboxesSEARCH — File Search adds multimodal search, embedding and searching images natively via gemini-embedding-2WEBHOOK — Event-driven webhooks arrive for the Batch API and long-running operations, replacing pollingEMBED — gemini-embedding-2 is now generally available for production embeddingsDEPRECATION — Several image generation models shut down on August 17, so plan migrations now
Articles/API / SDK
API / SDK/2026-07-04Advanced

When Gemini API Leaks Japanese Into Your English Output Once in a While — Field Notes on Measuring the Contamination Rate and Tightening It in Stages

You told Gemini to answer in English, and 3 out of 100 runs slip a Japanese sentence into the tail. Here is why you cannot stop that 'once in a while', and a production pattern that measures the contamination rate as an SLO and tightens it with graded recovery, with working code.

gemini-api262python97multilingual6observability12production130

Premium Article

I wrote what looked like a trivial handler that returns an English summary. Run it a hundred times and about three of them slip a single Japanese sentence into the tail. Twenty local runs never reproduced it; I found out days into production, from a reader. Of everything I ran into while letting Gemini API generate multilingual content as an indie developer, this "leaks once in a while" was the one that cost me the most.

What makes it nasty is that it does not look like a bug. No exception is thrown, the JSON is valid, and the overwhelming majority of outputs are flawless English. That is exactly why tests miss it and monitoring never fires while it slips through production. This piece treats "English was requested but Japanese leaks in" in a specific order: measure it before you try to stop it. Once leakage is observable as a continuous quantity, we layer System Instructions, few-shot, schema validation, and graded recovery — measuring the effect of each as we go.

Why "Once in a While" — It Happens Probabilistically, Not as a Binary

The direct cause is that Gemini is trained on a huge multilingual corpus and its attention is pulled strongly toward long Japanese input. A short Answer in English appended to the prompt loses relative strength as the body grows. On top of that, spans the model judges "more accurate to keep verbatim" — proper nouns, quotations — stay in the source language unless you push harder.

The easy thing to miss is that this behavior is not deterministic. For the same input, sampling noise means some runs leak and some do not. So the real quantity is not "does it leak or not" but "how many leaks per thousand outputs." In my translation pipeline, leakage became noticeable on Gemini 2.0 Flash past roughly 2,000 input tokens, and it never dropped to exactly zero on 2.5 Pro or 3.1 Pro either. That is why "just use a stronger model" is the wrong instinct. If the adversary is a probability, the countermeasure has to be "measure it and drive it below a threshold."

Measure the Leakage First — A Binary Detector and a Contamination Ratio

Before stacking fixes, make outputs countable. A lightweight function that detects the hiragana, katakana, and kanji Unicode ranges — and returns the fraction of contaminated characters — lets you draw thresholds later.

import re
from dataclasses import dataclass
 
JP_RANGE = re.compile(r"[぀-ヿ一-鿿]")
 
@dataclass
class LangCheck:
    contaminated: bool      # did any Japanese char appear
    jp_chars: int
    total_chars: int
    ratio: float            # contamination ratio (0.0-1.0)
    sample: str             # context around the first offending span
 
def check_english_output(text: str) -> LangCheck:
    matches = list(JP_RANGE.finditer(text))
    total = len(text) or 1
    sample = ""
    if matches:
        i = matches[0].start()
        sample = text[max(0, i - 20): i + 20]
    return LangCheck(bool(matches), len(matches), total,
                     len(matches) / total, sample)

The ratio field is the point. If you lump a mild case — one trailing Japanese sentence — together with a severe one — a fully Japanese response — under the same "contaminated" flag, you will make the wrong downstream call. Keeping sample (the context around the leak) lets you tell "a proper noun survived" from "the instruction was ignored" straight from the logs.

Then run this check on every production call and record the rate over time. I emit one line per request and aggregate the daily "share of contaminated requests" — the contamination rate. Treating it as an SLO (say, keep it under 0.5%) lets you talk about a fix's effect in numbers rather than vibes.

import json, time
 
def log_lang_metric(request_id, model, in_tokens, chk: LangCheck):
    print(json.dumps({
        "ts": time.time(), "request_id": request_id, "model": model,
        "in_tokens": in_tokens, "contaminated": chk.contaminated,
        "jp_ratio": round(chk.ratio, 4), "sample": chk.sample,
    }, ensure_ascii=False))

Always keep in_tokens alongside it. Aggregate later and you will see leakage correlates strongly with input length — and once you see the correlation, you can spend effort where it pays: harden only the long inputs.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A detector and logging design that treats language leakage as a continuous contamination rate, not a yes or no flag
A measured comparison of where System Instructions, few-shot, and response_schema each work and where each breaks
Graded recovery that quotes the offending span instead of blindly retrying, plus a clear line for handing off to humans
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-04-23
Gemini API × Langfuse — A Production Playbook for LLM Observability
A practical, production-grade guide to wiring Gemini API into Langfuse — tracing architecture, cost attribution, LLM-as-Judge on live traffic, PII masking, and sampling — with runnable code.
API / SDK2026-03-30
Gemini API Observability in Production — Logging, Monitoring, and Cost Tracking Patterns
Learn how to build a robust observability stack for production Gemini API deployments. Covers structured logging, token usage tracking, latency monitoring, and cost optimization dashboards with full implementation code.
API / SDK2026-07-02
Routing Between Local Gemma 4 and the Gemini API Cut My Bill from ¥32,000 to ¥9,000 — A Production Hybrid Router Design
How I cut a ¥32,000/month Gemini API bill to the ¥9,000 range with hybrid inference: routing design, a full Python router, production pitfalls, and how Gemma 4 arriving on the Gemini API in July 2026 changes the decision.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →