◈ API / SDK/2026-06-25Intermediate

Turning Articles into Audio with the Gemini 3.1 Flash TTS Preview: Splitting Long Text, Stitching It Back, and What It Actually Costs

The Gemini 3.1 Flash TTS preview launched today. Here is a single-narrator pipeline that converts a written article into clean audio, including how to split long text, stitch PCM without ugly seams, keep one voice steady, and estimate the real per-article cost.

gemini-3.1-flash-tts² text-to-speech⁴ audio⁷ podcast² python⁹³ cost-optimization²⁷

✦ Premium Article

The reason I kept putting off audio versions of my writing was always cost. I publish steadily on Dolice Labs, and running every single post through a text-to-speech model never balanced out between quality and money.

The Gemini 3.1 Flash TTS preview, which became available today, looks like it moves that break-even line. It keeps expression while staying cheap, and it is easy to steer. For a workload like article narration — long text, high frequency — those three traits compound. This is a walkthrough of converting a written article into a single-narrator voiceover and getting it ready for a channel like a podcast or stand.fm, looked at from both the implementation and the cost side.

Why a "preview model" shifts the economics of article narration

Narrating articles is not a one-off flashy demo. You pour 3,000 to 5,000 characters of body text through it almost daily, for months. That makes the unit cost per article the real weight on the project.

What used to stop me was a binary choice: models with expressive voices cost a lot, and cheap models read flatly. The arrival of a Flash-tier TTS preview puts a realistic option in the middle of that binary.

Boiled down to one card, there are three things to watch:

Characters per article (the billable volume)
How far prompt-level control holds (how rarely you re-record)
The cost of regeneration on failure (the leak in production)

Later in this piece I price the first item in real currency. The other two are mostly absorbed by how you build the pipeline, so let me clear those first.

You can't hand over long text in one shot

The first wall is that you cannot pass a whole article in a single request. TTS caps how much one synthesis call handles, and dumping several thousand characters in will cut off mid-way or wreck the prosody of the second half.

Split on sentence boundaries, sub-split only the long ones

My rule is never to split inside a sentence. Cutting mid-sentence makes the breath unnatural at the seam. For text, I use the period as the primary boundary and only sub-split overly long blocks at commas.

import re
 
def split_for_tts(text: str, max_chars: int = 280) -> list[str]:
    """Split article body into TTS-sized units.
    Never split inside a sentence; sub-split long sentences at commas."""
    raw = re.split(r"(?<=\.)\s+|\n+", text.strip())
    sentences = [s for s in raw if s]
 
    chunks: list[str] = []
    buf = ""
    for s in sentences:
        parts = re.split(r"(?<=,)\s", s) if len(s) > max_chars else [s]
        for p in parts:
            if len(buf) + len(p) <= max_chars:
                buf += (" " + p if buf else p)
            else:
                if buf:
                    chunks.append(buf)
                buf = p
    if buf:
        chunks.append(buf)
    return chunks
 
# A 4,000-character article lands as ~20 chunks of roughly 280 chars

I keep max_chars near 280 because longer makes the prosody monotone and shorter multiplies the seams — a compromise that surfaced through running it. The sweet spot moves with the material, so tune it.

Join PCM without a seam

Gemini TTS returns 24kHz, 16-bit, mono PCM. Receive audio per chunk, concatenate the raw PCM, and you get one track. The catch is to insert a short silence between chunks. Zero silence makes sentences collide; too much drags. I use 0.25 seconds between sentences.

import struct
 
SAMPLE_RATE = 24000  # Gemini TTS output sample rate
SILENCE_SEC = 0.25
 
def silence_pcm(seconds: float) -> bytes:
    n = int(SAMPLE_RATE * seconds)
    return struct.pack("<" + "h" * n, *([0] * n))  # 16-bit silence
 
def join_pcm(chunks_pcm: list[bytes]) -> bytes:
    gap = silence_pcm(SILENCE_SEC)
    out = bytearray()
    for i, pcm in enumerate(chunks_pcm):
        if i > 0:
            out += gap
        out += pcm
    return bytes(out)

With "split and join" as the skeleton, the pipeline survives any change in article length.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If TTS pricing made you give up on audio versions of your posts, you'll be able to estimate the cost per article yourself

✦You'll get the code to split a multi-thousand-character article by sentence and join it into one seamless audio file

✦You'll learn how to stop a single narrator's voice from drifting across chunks using prompt-level control

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Keep a single narrator steady to the end

Unlike a multi-speaker podcast, article narration needs one person reading at one temperature all the way through. The moment you split into chunks, each chunk is an independent synthesis, so without care the timbre and pace drift bit by bit.

Pin the style instruction across every chunk

I state the voice and reading style in the prompt and pass the exact same wording to every chunk. Not varying the instruction per chunk is the shortest path to a steady read.

from google import genai
from google.genai import types
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
# Shared style instruction — do not change it mid-article
STYLE = (
    "Read as a calm, neutral narrator for an explanatory article. "
    "Slightly slow pace, restrained emotion, polite endings, "
    "and clear pronunciation of technical terms."
)
 
def synthesize(chunk_text: str) -> bytes:
    res = client.models.generate_content(
        model="gemini-3.1-flash-preview-tts",
        contents=f"{STYLE}\n\nBody: {chunk_text}",
        config=types.GenerateContentConfig(
            response_modalities=["AUDIO"],
            speech_config=types.SpeechConfig(
                voice_config=types.VoiceConfig(
                    prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name="Charon",  # pick once, keep it for every article
                    )
                )
            ),
        ),
    )
    return res.candidates[0].content.parts[0].inline_data.data  # 24kHz PCM

My recommendation is to fix voice_name for the whole channel rather than per article. Listeners learn "this voice equals this outlet," so a changing voice reads as a different thing. On one of the Dolice Labs healing apps, a review once flagged the narration as off after I swapped voices mid-stream — voice consistency mattered more than I expected.

Normalize numbers and acronyms before synthesis

TTS struggles with context-dependent readings. Whether "3.1" should be "three point one" or something else drifts when left to the model. I pass text through a small substitution dictionary before synthesis to pin the readings I want.

READ_DICT = {
    "TTS": "T T S",
    "PCM": "P C M",
    "API": "A P I",
}
 
def normalize_reading(text: str) -> str:
    for k, v in READ_DICT.items():
        text = text.replace(k, v)
    return text

The dictionary is something you grow per channel; there is no need to be perfect from day one. Add one word at a time whenever your ear catches a stumble, and within a few weeks almost nothing trips it.

Estimate the per-article cost in real money

This is the heart of the economics. TTS billing scales with characters (or audio seconds), so once you know an article's character count, the unit cost is quite predictable.

Pricing can change, so always replace the assumptions with the current rate card. To show the method, I use $1.00 per one million input characters and treat the body plus the style instruction as the billable text. A 4,000-character article is roughly 4,300 billable characters once the style line is added.

Item	Value	Note
Body + instruction chars	~4,300	style line added to each chunk
Assumed rate	$1.00 / 1M chars	replace with live pricing
Per article	~$0.0043	4,300 ÷ 1,000,000 × $1.00
30 articles / month	~$0.13	0.0043 × 30

The number is so small it makes you double-check the decimal point, and that is exactly why I felt the preview moved the break-even line. Even if an expressive model cost 10x this, that is about $1.30 a month. With a 20% allowance for re-records you can still hold the cadence, which means audio fits into a daily publishing flow.

One caveat: a longer style instruction is billed on every chunk too. I keep the style line under about 60 characters and shared, so the overhead stays small. Write a 200-character instruction and a 4,000-character article with 20 chunks carries 4,000 extra billed characters of instructions — a quiet but real tax.

Three places it tends to break: finish_reason, sample rate, seams

Once it runs in production, quality drops in the same predictable spots. Pre-empt them.

Always check finish_reason

A long chunk or awkward symbols can stop synthesis early and return audio with the tail missing. Confirm finish_reason is a clean stop, and if not, re-split shorter and retry — that prevents shipping a clipped file.

def synthesize_safe(chunk_text: str, depth: int = 0) -> bytes:
    res = client.models.generate_content(
        model="gemini-3.1-flash-preview-tts",
        contents=f"{STYLE}\n\nBody: {chunk_text}",
        config=types.GenerateContentConfig(response_modalities=["AUDIO"]),
    )
    cand = res.candidates[0]
    if str(cand.finish_reason) not in ("FinishReason.STOP", "STOP"):
        if depth < 1 and len(chunk_text) > 80:  # split in half once and retry
            mid = len(chunk_text) // 2
            return synthesize_safe(chunk_text[:mid]) + synthesize_safe(chunk_text[mid:])
        raise RuntimeError(f"TTS abnormal end: {cand.finish_reason}")
    return cand.content.parts[0].inline_data.data

Do not mistake the sample rate

If you assume 24kHz PCM is 44.1kHz when you write the WAV header, the voice plays high and fast. This is a classic trap that shows up with Live API audio too, and the first time it bit me I lost real time finding the cause. Lock the output at 24kHz and use the same value through joining and conversion.

Keep the seam silence even

If chunks end with different amounts of trailing reverb, a flat 0.25s of silence still won't sound aligned. If it bothers you, trim the tiny trailing silence per chunk before adding fixed silence. By ear, as long as I split on the period, trailing trims were not even necessary.

Getting it onto the channel

Finally, convert the joined PCM into a distributable format. Most platforms accept MP3, so wrap it as WAV and convert.

import wave, subprocess
 
def pcm_to_wav(pcm: bytes, path: str):
    with wave.open(path, "wb") as w:
        w.setnchannels(1)      # mono
        w.setsampwidth(2)      # 16-bit
        w.setframerate(24000)  # fixed 24kHz
        w.writeframes(pcm)
 
def wav_to_mp3(wav_path: str, mp3_path: str):
    subprocess.run(
        ["ffmpeg", "-y", "-i", wav_path, "-b:a", "128k", mp3_path],
        check=True,
    )

With that in place, an article's Markdown goes in and one distribution-ready MP3 comes out. As an indie developer I wire this into the publish hook, so once the body is final the audio is generated automatically. The Dolice Labs membership runs on Stripe, but I keep audio on the free path — another doorway for the same article to reach readers.

It is an unglamorous mechanism, but adding one thin way to spread your writing into another channel quietly widens its reach. Start by running a single article through split_for_tts and producing that fraction-of-a-cent narration. That one file is what turns audio from "something I'll do someday" into part of today's publishing flow.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.