GEMINI LABJP
TTS — gemini-3.1-flash-tts-preview now streams speech generation via streamGenerateContent for lower latencyTRANSLATE — Gemini 3.5 Live Translate arrives, auto-detecting 70+ languages for speech-to-speech while preserving intonationIMAGE — Nano Banana 2 Lite launches as the fastest and most cost-efficient Gemini image modelOMNI — Gemini Omni Flash enters public preview as a natively multimodal model for custom video workflowsMODEL — Gemini 3.5 Flash reaches GA and now powers gemini-flash-latestAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Google-hosted Linux sandboxesTTS — gemini-3.1-flash-tts-preview now streams speech generation via streamGenerateContent for lower latencyTRANSLATE — Gemini 3.5 Live Translate arrives, auto-detecting 70+ languages for speech-to-speech while preserving intonationIMAGE — Nano Banana 2 Lite launches as the fastest and most cost-efficient Gemini image modelOMNI — Gemini Omni Flash enters public preview as a natively multimodal model for custom video workflowsMODEL — Gemini 3.5 Flash reaches GA and now powers gemini-flash-latestAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Google-hosted Linux sandboxes
Articles/API / SDK
API / SDK/2026-07-05Advanced

Building Conversational Translation Into an App: Speech-to-Speech With the Live API

A design walkthrough for adding speech-to-speech conversational translation to an app with Gemini 3.5 Live Translate and the Live API, covering session lifetime, automatic language switching, latency budgets, and streaming cost, with working code.

Gemini Live API5Live Translatevoice translationreal-time3architecture13

Premium Article

I still regret the years I thought about localization only in terms of text.

I run a few small, calming apps on my own, and one day an overseas user told me they wished the audio guidance could speak in their own language. At the time, the only approach I could imagine was translating text and reading it aloud. What came back was a stranger's voice, stripped of the pauses and inflection that make speech feel human.

Gemini 3.5 Live Translate, released in July 2026, quietly rewrites that premise. It detects the spoken language automatically across more than 70 languages and translates speech into speech while preserving the speaker's intonation. The single fact that it never routes through text changes the quality of the experience entirely.

This article lays out how to build conversational translation into an app using the Live API that underpins Live Translate. Rather than just how to call it, I focus on the four things you will always hit in production: latency, language switching, disconnection, and cost.

The design fork: never routing through text

The traditional translation flow chained three independent stages in series: speech recognition, translation, and speech synthesis. Each stage adds waiting time, and the pauses and emotion get sanded off as everything passes through text.

What Live Translate and the Live API change is that they fold those three stages into a single persistent session. You stream audio in, and the model returns translated audio. Intermediate text is available if you want it, but it is no longer the star of the experience.

For a designer, the implication is concrete. We no longer live in a world where we tune recognition accuracy and synthesis naturalness separately. Instead, the center of gravity shifts to a networking and session problem: how to keep one stream flowing without interruption, and how to keep it flowing cheaply.

The minimal setup: open a session, stream audio in, get audio back

Here is the skeleton. The Live API communicates bidirectionally over WebSockets, but the google-genai SDK abstracts it as an asynchronous session.

import asyncio
from google import genai
from google.genai import types
 
client = genai.Client(api_key="YOUR_API_KEY")
 
MODEL = "gemini-3.1-flash-live-preview"
 
config = {
    "response_modalities": ["AUDIO"],
    "system_instruction": (
        "You are a simultaneous interpreter. Detect the language the "
        "speaker uses. Interpret English into Japanese and Japanese into "
        "English, preserving intonation and tone, in speech. Do not "
        "summarize or add anything; translate only what was said."
    ),
    "input_audio_transcription": {},
    "output_audio_transcription": {},
}
 
 
async def translate_stream(mic_frames):
    async with client.aio.live.connect(model=MODEL, config=config) as session:
        async def pump_audio():
            async for chunk in mic_frames:  # 16-bit PCM / 16kHz / little-endian
                await session.send_realtime_input(
                    audio=types.Blob(data=chunk, mime_type="audio/pcm;rate=16000")
                )
 
        sender = asyncio.create_task(pump_audio())
 
        async for response in session.receive():
            content = response.server_content
            if not content:
                continue
            if content.input_transcription:
                print(f"[source] {content.input_transcription.text}")
            if content.output_transcription:
                print(f"[target] {content.output_transcription.text}")
            if content.model_turn:
                for part in content.model_turn.parts:
                    if part.inline_data:
                        yield part.inline_data.data  # 24kHz PCM audio
 
        await sender

Three things matter here. Input audio is sent as raw 16kHz PCM, output comes back as native audio in chunks, and transcriptions arrive separately from the audio.

Enabling input_audio_transcription and output_audio_transcription gives you a hook for captions, logs, and the language detection discussed below. Even if you only plan to handle audio, I recommend leaving these on. Your operational observability improves dramatically.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A working minimal setup: opening a Live API session and streaming audio in to get audio back
How to break down a latency budget and design reconnection that survives drops and session limits
Why streaming audio cost accrues quietly, and when to drop Live in favor of batch transcription
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-07-02
Routing Between Local Gemma 4 and the Gemini API Cut My Bill from ¥32,000 to ¥9,000 — A Production Hybrid Router Design
How I cut a ¥32,000/month Gemini API bill to the ¥9,000 range with hybrid inference: routing design, a full Python router, production pitfalls, and how Gemma 4 arriving on the Gemini API in July 2026 changes the decision.
API / SDK2026-06-30
Folding Scattered Call Sites Into One Front Door: Migrating to the Interactions API for Automation
With the Interactions API now generally available, Gemini's calls can settle behind a single entry point. Here is a migration design for folding scattered call sites — generateContent, Batch, and homegrown agent loops — into one front door without breaking anything, complete with a working adapter layer.
API / SDK2026-06-21
Should You Move Your Agent Loop to Gemini's Managed Agents? Three Questions That Decide What Migrates
With Gemini API's Managed Agents in public preview, deciding between a self-hosted agent loop and a Google-hosted sandbox is now a real question. Three questions — execution environment, state ownership, and failure recovery — decide what migrates and what stays.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →