GEMINI LABJP
DEPRECATION — The two image preview models shut down today, June 25; automations using them must migrate immediatelyGA — In their place, gemini-3.1-flash-image and gemini-3-pro-image are now the generally available native image modelsMEDIA — Video-to-image generation arrives: pass a video as context to create high-quality thumbnails (3.1 flash image only)AUDIO — Gemini 3.1 Flash TTS preview lands: a low-cost, expressive, steerable text-to-speech modelMODEL — Gemini 3.5 Flash is GA, beating 3.1 Pro on nearly every benchmark while running about 4x fasterSEARCH — File Search now supports multimodal search, embedding and searching images natively via gemini-embedding-2DEPRECATION — The two image preview models shut down today, June 25; automations using them must migrate immediatelyGA — In their place, gemini-3.1-flash-image and gemini-3-pro-image are now the generally available native image modelsMEDIA — Video-to-image generation arrives: pass a video as context to create high-quality thumbnails (3.1 flash image only)AUDIO — Gemini 3.1 Flash TTS preview lands: a low-cost, expressive, steerable text-to-speech modelMODEL — Gemini 3.5 Flash is GA, beating 3.1 Pro on nearly every benchmark while running about 4x fasterSEARCH — File Search now supports multimodal search, embedding and searching images natively via gemini-embedding-2
Articles/API / SDK
API / SDK/2026-06-25Intermediate

Turning Articles into Audio with the Gemini 3.1 Flash TTS Preview: Splitting Long Text, Stitching It Back, and What It Actually Costs

The Gemini 3.1 Flash TTS preview launched today. Here is a single-narrator pipeline that converts a written article into clean audio, including how to split long text, stitch PCM without ugly seams, keep one voice steady, and estimate the real per-article cost.

gemini-3.1-flash-tts2text-to-speech4audio7podcast2python93cost-optimization27

Premium Article

The reason I kept putting off audio versions of my writing was always cost. I publish steadily on Dolice Labs, and running every single post through a text-to-speech model never balanced out between quality and money.

The Gemini 3.1 Flash TTS preview, which became available today, looks like it moves that break-even line. It keeps expression while staying cheap, and it is easy to steer. For a workload like article narration — long text, high frequency — those three traits compound. This is a walkthrough of converting a written article into a single-narrator voiceover and getting it ready for a channel like a podcast or stand.fm, looked at from both the implementation and the cost side.

Why a "preview model" shifts the economics of article narration

Narrating articles is not a one-off flashy demo. You pour 3,000 to 5,000 characters of body text through it almost daily, for months. That makes the unit cost per article the real weight on the project.

What used to stop me was a binary choice: models with expressive voices cost a lot, and cheap models read flatly. The arrival of a Flash-tier TTS preview puts a realistic option in the middle of that binary.

Boiled down to one card, there are three things to watch:

  • Characters per article (the billable volume)
  • How far prompt-level control holds (how rarely you re-record)
  • The cost of regeneration on failure (the leak in production)

Later in this piece I price the first item in real currency. The other two are mostly absorbed by how you build the pipeline, so let me clear those first.

You can't hand over long text in one shot

The first wall is that you cannot pass a whole article in a single request. TTS caps how much one synthesis call handles, and dumping several thousand characters in will cut off mid-way or wreck the prosody of the second half.

Split on sentence boundaries, sub-split only the long ones

My rule is never to split inside a sentence. Cutting mid-sentence makes the breath unnatural at the seam. For text, I use the period as the primary boundary and only sub-split overly long blocks at commas.

import re
 
def split_for_tts(text: str, max_chars: int = 280) -> list[str]:
    """Split article body into TTS-sized units.
    Never split inside a sentence; sub-split long sentences at commas."""
    raw = re.split(r"(?<=\.)\s+|\n+", text.strip())
    sentences = [s for s in raw if s]
 
    chunks: list[str] = []
    buf = ""
    for s in sentences:
        parts = re.split(r"(?<=,)\s", s) if len(s) > max_chars else [s]
        for p in parts:
            if len(buf) + len(p) <= max_chars:
                buf += (" " + p if buf else p)
            else:
                if buf:
                    chunks.append(buf)
                buf = p
    if buf:
        chunks.append(buf)
    return chunks
 
# A 4,000-character article lands as ~20 chunks of roughly 280 chars

I keep max_chars near 280 because longer makes the prosody monotone and shorter multiplies the seams — a compromise that surfaced through running it. The sweet spot moves with the material, so tune it.

Join PCM without a seam

Gemini TTS returns 24kHz, 16-bit, mono PCM. Receive audio per chunk, concatenate the raw PCM, and you get one track. The catch is to insert a short silence between chunks. Zero silence makes sentences collide; too much drags. I use 0.25 seconds between sentences.

import struct
 
SAMPLE_RATE = 24000  # Gemini TTS output sample rate
SILENCE_SEC = 0.25
 
def silence_pcm(seconds: float) -> bytes:
    n = int(SAMPLE_RATE * seconds)
    return struct.pack("<" + "h" * n, *([0] * n))  # 16-bit silence
 
def join_pcm(chunks_pcm: list[bytes]) -> bytes:
    gap = silence_pcm(SILENCE_SEC)
    out = bytearray()
    for i, pcm in enumerate(chunks_pcm):
        if i > 0:
            out += gap
        out += pcm
    return bytes(out)

With "split and join" as the skeleton, the pipeline survives any change in article length.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
If TTS pricing made you give up on audio versions of your posts, you'll be able to estimate the cost per article yourself
You'll get the code to split a multi-thousand-character article by sentence and join it into one seamless audio file
You'll learn how to stop a single narrator's voice from drifting across chunks using prompt-level control
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-05-02
Building an AI Podcast Pipeline with Gemini 3.1 Flash TTS Emotional Tags and Multi-Speaker — A Complete Commercial Content System
Learn how to build a production-ready AI podcast generation pipeline using Gemini 3.1 Flash TTS's 200+ emotional tags and multi-speaker capabilities. From cost calculations to monetization strategy — everything you need to launch a content service generating $700/month as an indie developer.
API / SDK2026-06-22
Structured Product Image Analysis with the Gemini API — A Production Pipeline Built on Thousands of Photos
Turn a one-off image analysis script into a production pipeline that auto-generates tags, descriptions, and categories at scale — covering structured output, resumable batches, measured cost, and model routing learned from real indie-developer operation.
API / SDK2026-06-21
Gemini API Implicit Caching Not Working — Troubleshooting Guide by Root Cause
Troubleshoot Gemini API implicit caching issues: cache not hitting, unexpectedly high costs, or low cache hit rates. Covers token thresholds, prompt structure, model version consistency, TTL expiry, and multimodal caching with code examples.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →