Gemini TTS API: Generate Expressive Voice with Style Controls

What is Gemini TTS?

Gemini TTS (Text-to-Speech) API is Google's official service that converts text into natural-sounding speech. Available on Gemini 2.5 Flash and Pro, it stands out by allowing you to specify styles and emotions, generating expressive audio rather than robotic speech.

How Gemini TTS Differs from Traditional TTS

Traditional text-to-speech services (like Google Cloud TTS) concatenate pre-recorded audio samples, offering high quality but limited expressiveness. Gemini TTS, powered by generative AI, lets you control style and emotion through prompts—resulting in human-like, flexible voice output.

Common Use Cases

Podcast Production — Create multiple character voices
Video Narration — Auto-narrate YouTube and TikTok content
Game Development — Dynamically generate character dialogue
Accessibility — Audio content for visually impaired users
Language Learning — Pronunciation guides and pronunciation practice

Core Specifications of Gemini TTS

Supported Models

Gemini 2.5 Flash — Fast, cost-effective (recommended)
Gemini 2.0 Pro — Higher-quality output

Available Voice Types

Gemini TTS offers multiple voice options trained from diverse speakers:

Breeze — Neutral, suited for announcements
Melody — Warm, conversational tone
Sage — Calm, speaker-like quality
Ember — Emotional, storytelling tone
Juniper — High-energy, kid-friendly
Orbit — Deep, cinematic narration

Pricing

Gemini 2.5 Flash: $1 per 1 million characters (audio output)
Gemini 2.0 Pro: $2 per 1 million characters
Free Tier: Up to 1,000 requests/month

API Setup and Basic Usage

Step 1: Obtain a Gemini API Key

Visit Google AI Studio and create an API key.

# Set environment variable
export GEMINI_API_KEY="YOUR_GEMINI_API_KEY"

Step 2: Install Python Library

pip install google-generativeai

Step 3: Minimal Code Example

import google.generativeai as genai
 
genai.configure(api_key="YOUR_GEMINI_API_KEY")
 
# Convert text to speech
response = genai.GenerativeModel(
    model_name="gemini-2.5-flash",
    system_prompt="You are a voice actor with expertise in audio narration."
).generate_content(
    genai.Part.from_data(
        mime_type="text/plain",
        data="Hello, world! This is a test of Gemini TTS."
    )
)
 
# Save as audio file
with open("output.mp3", "wb") as f:
    f.write(response.parts[0].inline_data.data)

Controlling Voice Style and Tone

Using Prompts to Set Emotion

The true power of Gemini TTS lies in prompt-based control. You can specify detailed emotional and tonal characteristics for your generated speech.

Example 1: Casual Friendly Conversation

style_prompt = """
Generate speech for a friendly conversation between friends.
Tone: Warm, approachable, slightly enthusiastic.
Speaking style: Casual, conversational, like chatting over coffee.
Text: "Hey! I just finished this amazing article about AI.
You've got to check it out when you have time!"
"""
 
response = genai.GenerativeModel("gemini-2.5-flash").generate_content(style_prompt)

Example 2: Professional Presentation

style_prompt = """
Generate speech for a professional presentation.
Tone: Confident, authoritative, measured.
Speaking style: Formal yet engaging, with natural pauses.
Text: "Today, we'll explore the fundamental principles of machine learning,
and how they're transforming industries worldwide."
"""

Example 3: Dramatic Storytelling

style_prompt = """
Generate speech for dramatic storytelling.
Tone: Emotional, dramatic, engaging.
Speaking style: Expressive, with dynamic energy shifts.
Pacing: Variable - slower for emotional moments, faster for action.
Text: "The sun set over the horizon as she realized everything had changed.
In that moment, she understood what she had to do."
"""

Fine-tuning with Audio Parameters

Fine-tune output with these parameters:

speed: 0.5 to 2.0 (slow to fast)
pitch: -10.0 to 10.0 (low to high)
volume_gain_db: -16.0 to 16.0 (quiet to loud)

response = genai.GenerativeModel("gemini-2.5-flash").generate_content(
    genai.Part.from_data(
        mime_type="application/json",
        data={
            "text": "Your narration text here",
            "voice_config": {
                "pitch": 0.0,
                "speaking_rate": 1.0,
                "volume_gain_db": 0.0
            }
        }
    )
)

Multi-Speaker TTS

For conversational content (podcasts, radio dramas), switching between speakers is essential.

Setting Speaker Labels

multi_speaker_text = """
Host: Welcome to AI Talk Podcast! Today's guest is an AI researcher.
 
Guest: Thank you for having me! I'm excited to share what we've learned.
 
Host: Let's dive in. What's the biggest misconception about AI?
 
Guest: People often think AI can reason like humans, but...
"""
 
# Generate audio for each speaker separately
host_audio = generate_tts(multi_speaker_text, speaker="host", style="warm")
guest_audio = generate_tts(multi_speaker_text, speaker="guest", style="authoritative")
 
# Mix audio tracks
merge_audio(host_audio, guest_audio, output_file="podcast.mp3")

Practical Use Cases

Use Case 1: Podcast Production

Automatically narrate scripts with multiple speakers, then add background music and sound effects for complete podcast automation.

Use Case 2: Video Narration

Auto-narrate YouTube and TikTok content with professional-quality voices. Reduces video production time by 90%.

Use Case 3: Game Development

Generate character dialogue in real-time, supporting dynamic storyline branching with natural voice acting.

Common Errors and Solutions

Error 1: "API quota exceeded"

Cause: Monthly request limit reached
Solution: Upgrade from Free Tier to a paid plan, or distribute usage across months

Error 2: "Invalid audio format"

Cause: Output format not MP3/WAV, or incorrect bitrate
Solution: Use PCM or MP3 (128kbps or higher), convert with ffmpeg if needed

Error 3: Long text generation failure

Cause: Single request exceeds 5,000 characters
Solution: Split text into chunks and make multiple API calls

Wrapping up

Gemini TTS API is a powerful tool for converting text into human-sounding speech. Its key strengths are flexible prompt-based expression control, multi-speaker support, and affordable pricing. Whether you're creating podcasts, narrating videos, or developing games, Gemini TTS enables new possibilities for audio content creation.

Get your API key today and start with "Hello, world!"

Reference Books: AI and Machine Learning Engineer's Implementation Guide (O'Reilly)