⟐ Dev Tools/2026-07-03Advanced

Stop Making Listeners Wait for the Whole File — Wiring Gemini TTS Streaming into Your Delivery Path

gemini-3.1-flash-tts-preview now streams audio via streamGenerateContent. A delivery path with 1.8s to first sound, covering PCM boundary handling, sentence-level resume, and a fallback for preview shutdown.

Gemini API¹⁶⁴ TTS² streaming²⁶ audio generation FastAPI

✦ Premium Article

As an indie developer, I have been quietly experimenting with making the articles on my sites listenable as audio.

The bottleneck was never generation quality — it was waiting. Feeding a 3,800-character draft to batch TTS took an average of 41 seconds before a finished file existed. That is fine for podcast-style pre-rendering. It is not fine for a reader who just pressed a "listen" button on the page.

The July 2026 update changed the premise: gemini-3.1-flash-tts-preview now supports streaming audio generation through streamGenerateContent, so you can deliver audio while it is being made instead of after (Gemini API changelog).

I rebuilt my delivery path around it. The SDK call itself is easy; the design decisions live downstream, in how you actually deliver the bytes. This article documents the configuration I settled on, with code and measured numbers.

What Actually Changes Between "Render Then Deliver" and "Deliver While Rendering"

Batch and streaming TTS look similar but have different centers of gravity. Sorting this out first keeps later decisions honest.

Aspect	Batch (render, then deliver)	Streaming (deliver while rendering)
Time to first sound	Full render must finish (measured 41s / 3,800 chars)	First chunk arrival (measured 1.8s)
Artifact	A finished file (WAV/MP3)	A sequence of PCM chunks; the file is assembled later, if at all
Failure semantics	Regenerate from scratch — idempotent	The listener already heard part of it; you must decide where to resume
Best fit	Podcasts, video narration, pre-rendered archives	Listen buttons, conversational UI, on-the-spot playback

My conclusion up front: I kept batch for archived audio and switched only the on-the-spot listening path to streaming. There is no need to force everything onto one mode.

The Intake — Pulling PCM Chunks Out of streamGenerateContent

The server-side intake is short. The key fact: what you get from each chunk is raw PCM — 24kHz, 16-bit, mono.

# tts_stream.py — streaming TTS intake
from google import genai
from google.genai import types
 
client = genai.Client()  # reads GEMINI_API_KEY from the environment
 
TTS_MODEL = "gemini-3.1-flash-tts-preview"  # keep this in config — see the last section
 
def stream_tts(text: str):
    """Turn text into a generator of audio chunks (24kHz 16-bit mono PCM)."""
    stream = client.models.generate_content_stream(
        model=TTS_MODEL,
        contents=text,
        config=types.GenerateContentConfig(
            response_modalities=["AUDIO"],
            speech_config=types.SpeechConfig(
                voice_config=types.VoiceConfig(
                    prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name="Kore"
                    )
                )
            ),
        ),
    )
    for chunk in stream:
        if not chunk.candidates:
            continue
        part = chunk.candidates[0].content.parts[0]
        if part.inline_data and part.inline_data.data:
            yield part.inline_data.data

Putting it on HTTP is plain chunked transfer with FastAPI:

# server.py — deliver over chunked transfer
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from tts_stream import stream_tts
 
app = FastAPI()
 
@app.get("/tts")
def tts(text: str):
    return StreamingResponse(
        stream_tts(text),
        media_type="audio/L16;rate=24000;channels=1",
        headers={"Cache-Control": "no-store"},
    )

The audio/L16 content type is deliberate — which brings us to the question of why we are not simply returning a WAV.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A minimal FastAPI setup that relays streamGenerateContent audio chunks over chunked transfer, with a measured 1.8s to first sound — roughly 23x faster than batch

✦Three options for the 'WAV header needs a length you don't have yet' problem, why raw PCM plus client-side playback won, and the Int16 boundary carry-over buffer you'll need

✦A resume design that restarts from sentence boundaries instead of byte offsets, plus an automatic batch fallback for the day the preview model goes away

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

The WAV Header Problem — Choosing a Format When the Length Isn't Known Yet

A WAV header contains a data-length field. In streaming generation the total duration is unknown until the very end, so you cannot write a correct header up front. I evaluated three options.

Write a fake maximum length into the header and feed it to an <audio> tag. Works in some browsers, but duration display breaks and I hit playback stalls in Safari-family browsers. Rejected.
Transcode to MP3 on the fly server-side. Requires a resident ffmpeg, a dependency I did not want to bring into my Cloudflare Workers-centric setup. Rejected.
Deliver raw PCM and play it with the Web Audio API on the client. The header problem simply disappears. Adopted.

Here is the client. There is one pitfall I only discovered by implementing it: HTTP chunk boundaries do not respect Int16 boundaries. A chunk can end on an odd byte; read it straight into an Int16Array and you get an exception — or worse, one-sample-shifted noise. A one-byte carry-over buffer absorbs it.

// player.js — play raw PCM as it arrives (with Int16 boundary carry-over)
async function playStream(text) {
  const res = await fetch(`/tts?text=${encodeURIComponent(text)}`);
  const reader = res.body.getReader();
  const ctx = new AudioContext({ sampleRate: 24000 });
  let playhead = ctx.currentTime + 0.3; // 300ms initial buffer
  let carry = new Uint8Array(0);        // odd-byte carry-over
 
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
 
    const merged = new Uint8Array(carry.length + value.length);
    merged.set(carry); merged.set(value, carry.length);
    const usable = merged.length - (merged.length % 2);
    carry = merged.slice(usable);
 
    const pcm = new Int16Array(merged.buffer, 0, usable / 2);
    const buf = ctx.createBuffer(1, pcm.length, 24000);
    const ch = buf.getChannelData(0);
    for (let i = 0; i < pcm.length; i++) ch[i] = pcm[i] / 32768;
 
    const src = ctx.createBufferSource();
    src.buffer = buf;
    src.connect(ctx.destination);
    src.start(playhead);
    playhead += buf.duration;
  }
}

The 300ms initial buffer exists to absorb jitter in chunk arrival. At 100ms I got an audible click every few dozen seconds in my environment; at 300ms it disappeared. This number depends on your network and region, so measure on real devices before tightening it.

Failure Semantics — Resume at Sentence Boundaries, Not Byte Offsets

This is the design decision unique to streaming. With batch, failure means "regenerate everything" and nobody notices. With streaming, the listener has already heard part of the audio; restarting from the top is the worst possible experience.

My first instinct was byte-offset resume. I abandoned it quickly: generation is not deterministic, so a second render of the same text does not match the first one byte-for-byte. Offset-based resume cannot work even in principle.

So I raised the resume unit to the sentence. Split the draft into sentences; the client only remembers how many sentences it has finished playing. On reconnect, send the remainder of the draft.

# resume.py — sentence-boundary resume
import re
 
def split_sentences(text: str) -> list[str]:
    return [s for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]
 
@app.get("/tts/resume")
def tts_resume(text: str, done_sentences: int = 0):
    sentences = split_sentences(text)
    remaining = " ".join(sentences[done_sentences:])
    if not remaining:
        return Response(status_code=204)
    return StreamingResponse(
        stream_tts(remaining),
        media_type="audio/L16;rate=24000;channels=1",
    )

The client advances done_sentences by accumulating estimated per-sentence duration (derived from character counts — see the coefficient below). Resume does replay a little (1.4 sentences on average), which is well within acceptable compared to starting over.

Measured Numbers, and Insurance for Running a Preview Model in Production

Measured across ten 3,800-character article drafts:

Time to first sound: batch averaged 41.2s → streaming averaged 1.8s (roughly 23x)
Chunk arrival interval: median ~180ms, p95 ~420ms (the basis for the 300ms initial buffer)
Duration estimation: my drafts land within ±5% of measured audio length using a flat characters-per-second coefficient — useful for missing-audio detection as well
Disconnect rate: about 2% on connections longer than five minutes; with sentence-boundary resume, only 1.4 sentences get regenerated on average
Cost: the amount of audio generated is identical, so I observed no cost difference between batch and streaming

One last thing I would not gloss over: this is a preview model. In June I had an automation stop when the image preview models were shut down on 6/25, and since then I write the escape route before putting any preview model on a production path.

# fallback.py — degrade to batch when the preview goes away
def tts_with_fallback(text: str):
    try:
        yield from stream_tts(text)
    except Exception as e:
        if "NOT_FOUND" in str(e) or "deprecated" in str(e).lower():
            # streaming unavailable — fall back to batch (slower, but the path survives)
            audio = batch_tts(text)  # your existing batch implementation
            yield audio
        else:
            raise

Keep the model name in config, and health-check streaming availability once at startup. Those two habits turn a sudden preview shutdown into a degradation instead of an outage.

Start by leaving your batch TTS untouched and swapping only the listen-button path over to this setup. First sound in under two seconds changes the experience more than the number suggests. I hope this saves you a rebuild or two.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.