GEMINI LABJP
API — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAIENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companiesAGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choicesSPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContentDATA — Crossbeam data stores can now connect to Gemini Enterprise in public previewMODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloadsAPI — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAIENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companiesAGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choicesSPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContentDATA — Crossbeam data stores can now connect to Gemini Enterprise in public previewMODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloads
Articles/Dev Tools
Dev Tools/2026-07-03Advanced

Stop Making Listeners Wait for the Whole File — Wiring Gemini TTS Streaming into Your Delivery Path

gemini-3.1-flash-tts-preview now streams audio via streamGenerateContent. A delivery path with 1.8s to first sound, covering PCM boundary handling, sentence-level resume, and a fallback for preview shutdown.

Gemini API164TTS2streaming26audio generationFastAPI

Premium Article

As an indie developer, I have been quietly experimenting with making the articles on my sites listenable as audio.

The bottleneck was never generation quality — it was waiting. Feeding a 3,800-character draft to batch TTS took an average of 41 seconds before a finished file existed. That is fine for podcast-style pre-rendering. It is not fine for a reader who just pressed a "listen" button on the page.

The July 2026 update changed the premise: gemini-3.1-flash-tts-preview now supports streaming audio generation through streamGenerateContent, so you can deliver audio while it is being made instead of after (Gemini API changelog).

I rebuilt my delivery path around it. The SDK call itself is easy; the design decisions live downstream, in how you actually deliver the bytes. This article documents the configuration I settled on, with code and measured numbers.

What Actually Changes Between "Render Then Deliver" and "Deliver While Rendering"

Batch and streaming TTS look similar but have different centers of gravity. Sorting this out first keeps later decisions honest.

AspectBatch (render, then deliver)Streaming (deliver while rendering)
Time to first soundFull render must finish (measured 41s / 3,800 chars)First chunk arrival (measured 1.8s)
ArtifactA finished file (WAV/MP3)A sequence of PCM chunks; the file is assembled later, if at all
Failure semanticsRegenerate from scratch — idempotentThe listener already heard part of it; you must decide where to resume
Best fitPodcasts, video narration, pre-rendered archivesListen buttons, conversational UI, on-the-spot playback

My conclusion up front: I kept batch for archived audio and switched only the on-the-spot listening path to streaming. There is no need to force everything onto one mode.

The Intake — Pulling PCM Chunks Out of streamGenerateContent

The server-side intake is short. The key fact: what you get from each chunk is raw PCM — 24kHz, 16-bit, mono.

# tts_stream.py — streaming TTS intake
from google import genai
from google.genai import types
 
client = genai.Client()  # reads GEMINI_API_KEY from the environment
 
TTS_MODEL = "gemini-3.1-flash-tts-preview"  # keep this in config — see the last section
 
def stream_tts(text: str):
    """Turn text into a generator of audio chunks (24kHz 16-bit mono PCM)."""
    stream = client.models.generate_content_stream(
        model=TTS_MODEL,
        contents=text,
        config=types.GenerateContentConfig(
            response_modalities=["AUDIO"],
            speech_config=types.SpeechConfig(
                voice_config=types.VoiceConfig(
                    prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name="Kore"
                    )
                )
            ),
        ),
    )
    for chunk in stream:
        if not chunk.candidates:
            continue
        part = chunk.candidates[0].content.parts[0]
        if part.inline_data and part.inline_data.data:
            yield part.inline_data.data

Putting it on HTTP is plain chunked transfer with FastAPI:

# server.py — deliver over chunked transfer
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from tts_stream import stream_tts
 
app = FastAPI()
 
@app.get("/tts")
def tts(text: str):
    return StreamingResponse(
        stream_tts(text),
        media_type="audio/L16;rate=24000;channels=1",
        headers={"Cache-Control": "no-store"},
    )

The audio/L16 content type is deliberate — which brings us to the question of why we are not simply returning a WAV.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A minimal FastAPI setup that relays streamGenerateContent audio chunks over chunked transfer, with a measured 1.8s to first sound — roughly 23x faster than batch
Three options for the 'WAV header needs a length you don't have yet' problem, why raw PCM plus client-side playback won, and the Int16 boundary carry-over buffer you'll need
A resume design that restarts from sentence boundaries instead of byte offsets, plus an automatic batch fallback for the day the preview model goes away
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Dev Tools2026-07-02
Deleting the Source Isn't Enough — A Ledger Design for Propagating Deletes Through Gemini-Derived Data
When a user deletes their data, the embeddings, caches, and File Search documents you generated from it live on. A provenance ledger written at generation time, per-sink propagation workers, and a verification sweep make deletion actually reach your derived data.
Dev Tools2026-07-02
url_context Still Answers When the Fetch Fails — Gating on Retrieval Status Before You Trust It
The url_context tool returns a confident answer even when it failed to fetch the target page. This walks through reading url_retrieval_status from url_context_metadata to build a verification gate, plus a fallback that only finalizes an answer when the source URL was truly read.
Dev Tools2026-06-20
Routing Gemini by Pipeline Stage: Draft on Flash, Finish on the Top Tier
A record of reworking which Gemini model handles which stage of an automation pipeline, prompted by the general availability of Gemini 3.5 Flash and the rollout of 3.1 Flash-Lite. Includes a small router that splits work into draft, classify, and finalize stages, how the cost picture changes, and the guardrails I settled on.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →