●API — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAI●ENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companies●AGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choices●SPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContent●DATA — Crossbeam data stores can now connect to Gemini Enterprise in public preview●MODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloads●API — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAI●ENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companies●AGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choices●SPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContent●DATA — Crossbeam data stores can now connect to Gemini Enterprise in public preview●MODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloads
Stop Making Listeners Wait for the Whole File — Wiring Gemini TTS Streaming into Your Delivery Path
gemini-3.1-flash-tts-preview now streams audio via streamGenerateContent. A delivery path with 1.8s to first sound, covering PCM boundary handling, sentence-level resume, and a fallback for preview shutdown.
As an indie developer, I have been quietly experimenting with making the articles on my sites listenable as audio.
The bottleneck was never generation quality — it was waiting. Feeding a 3,800-character draft to batch TTS took an average of 41 seconds before a finished file existed. That is fine for podcast-style pre-rendering. It is not fine for a reader who just pressed a "listen" button on the page.
The July 2026 update changed the premise: gemini-3.1-flash-tts-preview now supports streaming audio generation through streamGenerateContent, so you can deliver audio while it is being made instead of after (Gemini API changelog).
I rebuilt my delivery path around it. The SDK call itself is easy; the design decisions live downstream, in how you actually deliver the bytes. This article documents the configuration I settled on, with code and measured numbers.
What Actually Changes Between "Render Then Deliver" and "Deliver While Rendering"
Batch and streaming TTS look similar but have different centers of gravity. Sorting this out first keeps later decisions honest.
Aspect
Batch (render, then deliver)
Streaming (deliver while rendering)
Time to first sound
Full render must finish (measured 41s / 3,800 chars)
First chunk arrival (measured 1.8s)
Artifact
A finished file (WAV/MP3)
A sequence of PCM chunks; the file is assembled later, if at all
Failure semantics
Regenerate from scratch — idempotent
The listener already heard part of it; you must decide where to resume
My conclusion up front: I kept batch for archived audio and switched only the on-the-spot listening path to streaming. There is no need to force everything onto one mode.
The Intake — Pulling PCM Chunks Out of streamGenerateContent
The server-side intake is short. The key fact: what you get from each chunk is raw PCM — 24kHz, 16-bit, mono.
# tts_stream.py — streaming TTS intakefrom google import genaifrom google.genai import typesclient = genai.Client() # reads GEMINI_API_KEY from the environmentTTS_MODEL = "gemini-3.1-flash-tts-preview" # keep this in config — see the last sectiondef stream_tts(text: str): """Turn text into a generator of audio chunks (24kHz 16-bit mono PCM).""" stream = client.models.generate_content_stream( model=TTS_MODEL, contents=text, config=types.GenerateContentConfig( response_modalities=["AUDIO"], speech_config=types.SpeechConfig( voice_config=types.VoiceConfig( prebuilt_voice_config=types.PrebuiltVoiceConfig( voice_name="Kore" ) ) ), ), ) for chunk in stream: if not chunk.candidates: continue part = chunk.candidates[0].content.parts[0] if part.inline_data and part.inline_data.data: yield part.inline_data.data
Putting it on HTTP is plain chunked transfer with FastAPI:
The audio/L16 content type is deliberate — which brings us to the question of why we are not simply returning a WAV.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A minimal FastAPI setup that relays streamGenerateContent audio chunks over chunked transfer, with a measured 1.8s to first sound — roughly 23x faster than batch
✦Three options for the 'WAV header needs a length you don't have yet' problem, why raw PCM plus client-side playback won, and the Int16 boundary carry-over buffer you'll need
✦A resume design that restarts from sentence boundaries instead of byte offsets, plus an automatic batch fallback for the day the preview model goes away
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The WAV Header Problem — Choosing a Format When the Length Isn't Known Yet
A WAV header contains a data-length field. In streaming generation the total duration is unknown until the very end, so you cannot write a correct header up front. I evaluated three options.
Write a fake maximum length into the header and feed it to an <audio> tag. Works in some browsers, but duration display breaks and I hit playback stalls in Safari-family browsers. Rejected.
Transcode to MP3 on the fly server-side. Requires a resident ffmpeg, a dependency I did not want to bring into my Cloudflare Workers-centric setup. Rejected.
Deliver raw PCM and play it with the Web Audio API on the client. The header problem simply disappears. Adopted.
Here is the client. There is one pitfall I only discovered by implementing it: HTTP chunk boundaries do not respect Int16 boundaries. A chunk can end on an odd byte; read it straight into an Int16Array and you get an exception — or worse, one-sample-shifted noise. A one-byte carry-over buffer absorbs it.
// player.js — play raw PCM as it arrives (with Int16 boundary carry-over)async function playStream(text) { const res = await fetch(`/tts?text=${encodeURIComponent(text)}`); const reader = res.body.getReader(); const ctx = new AudioContext({ sampleRate: 24000 }); let playhead = ctx.currentTime + 0.3; // 300ms initial buffer let carry = new Uint8Array(0); // odd-byte carry-over while (true) { const { done, value } = await reader.read(); if (done) break; const merged = new Uint8Array(carry.length + value.length); merged.set(carry); merged.set(value, carry.length); const usable = merged.length - (merged.length % 2); carry = merged.slice(usable); const pcm = new Int16Array(merged.buffer, 0, usable / 2); const buf = ctx.createBuffer(1, pcm.length, 24000); const ch = buf.getChannelData(0); for (let i = 0; i < pcm.length; i++) ch[i] = pcm[i] / 32768; const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination); src.start(playhead); playhead += buf.duration; }}
The 300ms initial buffer exists to absorb jitter in chunk arrival. At 100ms I got an audible click every few dozen seconds in my environment; at 300ms it disappeared. This number depends on your network and region, so measure on real devices before tightening it.
Failure Semantics — Resume at Sentence Boundaries, Not Byte Offsets
This is the design decision unique to streaming. With batch, failure means "regenerate everything" and nobody notices. With streaming, the listener has already heard part of the audio; restarting from the top is the worst possible experience.
My first instinct was byte-offset resume. I abandoned it quickly: generation is not deterministic, so a second render of the same text does not match the first one byte-for-byte. Offset-based resume cannot work even in principle.
So I raised the resume unit to the sentence. Split the draft into sentences; the client only remembers how many sentences it has finished playing. On reconnect, send the remainder of the draft.
# resume.py — sentence-boundary resumeimport redef split_sentences(text: str) -> list[str]: return [s for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]@app.get("/tts/resume")def tts_resume(text: str, done_sentences: int = 0): sentences = split_sentences(text) remaining = " ".join(sentences[done_sentences:]) if not remaining: return Response(status_code=204) return StreamingResponse( stream_tts(remaining), media_type="audio/L16;rate=24000;channels=1", )
The client advances done_sentences by accumulating estimated per-sentence duration (derived from character counts — see the coefficient below). Resume does replay a little (1.4 sentences on average), which is well within acceptable compared to starting over.
Measured Numbers, and Insurance for Running a Preview Model in Production
Measured across ten 3,800-character article drafts:
Time to first sound: batch averaged 41.2s → streaming averaged 1.8s (roughly 23x)
Chunk arrival interval: median ~180ms, p95 ~420ms (the basis for the 300ms initial buffer)
Duration estimation: my drafts land within ±5% of measured audio length using a flat characters-per-second coefficient — useful for missing-audio detection as well
Disconnect rate: about 2% on connections longer than five minutes; with sentence-boundary resume, only 1.4 sentences get regenerated on average
Cost: the amount of audio generated is identical, so I observed no cost difference between batch and streaming
One last thing I would not gloss over: this is a preview model. In June I had an automation stop when the image preview models were shut down on 6/25, and since then I write the escape route before putting any preview model on a production path.
# fallback.py — degrade to batch when the preview goes awaydef tts_with_fallback(text: str): try: yield from stream_tts(text) except Exception as e: if "NOT_FOUND" in str(e) or "deprecated" in str(e).lower(): # streaming unavailable — fall back to batch (slower, but the path survives) audio = batch_tts(text) # your existing batch implementation yield audio else: raise
Keep the model name in config, and health-check streaming availability once at startup. Those two habits turn a sudden preview shutdown into a degradation instead of an outage.
Start by leaving your batch TTS untouched and swapping only the listen-button path over to this setup. First sound in under two seconds changes the experience more than the number suggests. I hope this saves you a rebuild or two.
Share
Thank You for Reading
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.