What is Gemini TTS?
Gemini TTS (Text-to-Speech) API is Google's official service that converts text into natural-sounding speech. Available on Gemini 2.5 Flash and Pro, it stands out by allowing you to specify styles and emotions, generating expressive audio rather than robotic speech.
How Gemini TTS Differs from Traditional TTS
Traditional text-to-speech services (like Google Cloud TTS) concatenate pre-recorded audio samples, offering high quality but limited expressiveness. Gemini TTS, powered by generative AI, lets you control style and emotion through prompts—resulting in human-like, flexible voice output.
Common Use Cases
- Podcast Production — Create multiple character voices
- Video Narration — Auto-narrate YouTube and TikTok content
- Game Development — Dynamically generate character dialogue
- Accessibility — Audio content for visually impaired users
- Language Learning — Pronunciation guides and pronunciation practice
Core Specifications of Gemini TTS
Supported Models
- Gemini 2.5 Flash — Fast, cost-effective (recommended)
- Gemini 2.0 Pro — Higher-quality output
Available Voice Types
Gemini TTS offers multiple voice options trained from diverse speakers:
- Breeze — Neutral, suited for announcements
- Melody — Warm, conversational tone
- Sage — Calm, speaker-like quality
- Ember — Emotional, storytelling tone
- Juniper — High-energy, kid-friendly
- Orbit — Deep, cinematic narration
Pricing
- Gemini 2.5 Flash: $1 per 1 million characters (audio output)
- Gemini 2.0 Pro: $2 per 1 million characters
- Free Tier: Up to 1,000 requests/month
API Setup and Basic Usage
Step 1: Obtain a Gemini API Key
Visit Google AI Studio and create an API key.
# Set environment variable
export GEMINI_API_KEY="YOUR_GEMINI_API_KEY"Step 2: Install Python Library
pip install google-generativeaiStep 3: Minimal Code Example
import google.generativeai as genai
genai.configure(api_key="YOUR_GEMINI_API_KEY")
# Convert text to speech
response = genai.GenerativeModel(
model_name="gemini-2.5-flash",
system_prompt="You are a voice actor with expertise in audio narration."
).generate_content(
genai.Part.from_data(
mime_type="text/plain",
data="Hello, world! This is a test of Gemini TTS."
)
)
# Save as audio file
with open("output.mp3", "wb") as f:
f.write(response.parts[0].inline_data.data)Controlling Voice Style and Tone
Using Prompts to Set Emotion
The true power of Gemini TTS lies in prompt-based control. You can specify detailed emotional and tonal characteristics for your generated speech.
Example 1: Casual Friendly Conversation
style_prompt = """
Generate speech for a friendly conversation between friends.
Tone: Warm, approachable, slightly enthusiastic.
Speaking style: Casual, conversational, like chatting over coffee.
Text: "Hey! I just finished this amazing article about AI.
You've got to check it out when you have time!"
"""
response = genai.GenerativeModel("gemini-2.5-flash").generate_content(style_prompt)Example 2: Professional Presentation
style_prompt = """
Generate speech for a professional presentation.
Tone: Confident, authoritative, measured.
Speaking style: Formal yet engaging, with natural pauses.
Text: "Today, we'll explore the fundamental principles of machine learning,
and how they're transforming industries worldwide."
"""Example 3: Dramatic Storytelling
style_prompt = """
Generate speech for dramatic storytelling.
Tone: Emotional, dramatic, engaging.
Speaking style: Expressive, with dynamic energy shifts.
Pacing: Variable - slower for emotional moments, faster for action.
Text: "The sun set over the horizon as she realized everything had changed.
In that moment, she understood what she had to do."
"""Fine-tuning with Audio Parameters
Fine-tune output with these parameters:
- speed: 0.5 to 2.0 (slow to fast)
- pitch: -10.0 to 10.0 (low to high)
- volume_gain_db: -16.0 to 16.0 (quiet to loud)
response = genai.GenerativeModel("gemini-2.5-flash").generate_content(
genai.Part.from_data(
mime_type="application/json",
data={
"text": "Your narration text here",
"voice_config": {
"pitch": 0.0,
"speaking_rate": 1.0,
"volume_gain_db": 0.0
}
}
)
)Multi-Speaker TTS
For conversational content (podcasts, radio dramas), switching between speakers is essential.
Setting Speaker Labels
multi_speaker_text = """
Host: Welcome to AI Talk Podcast! Today's guest is an AI researcher.
Guest: Thank you for having me! I'm excited to share what we've learned.
Host: Let's dive in. What's the biggest misconception about AI?
Guest: People often think AI can reason like humans, but...
"""
# Generate audio for each speaker separately
host_audio = generate_tts(multi_speaker_text, speaker="host", style="warm")
guest_audio = generate_tts(multi_speaker_text, speaker="guest", style="authoritative")
# Mix audio tracks
merge_audio(host_audio, guest_audio, output_file="podcast.mp3")Practical Use Cases
Use Case 1: Podcast Production
Automatically narrate scripts with multiple speakers, then add background music and sound effects for complete podcast automation.
Use Case 2: Video Narration
Auto-narrate YouTube and TikTok content with professional-quality voices. Reduces video production time by 90%.
Use Case 3: Game Development
Generate character dialogue in real-time, supporting dynamic storyline branching with natural voice acting.
Common Errors and Solutions
Error 1: "API quota exceeded"
Cause: Monthly request limit reached
Solution: Upgrade from Free Tier to a paid plan, or distribute usage across months
Error 2: "Invalid audio format"
Cause: Output format not MP3/WAV, or incorrect bitrate
Solution: Use PCM or MP3 (128kbps or higher), convert with ffmpeg if needed
Error 3: Long text generation failure
Cause: Single request exceeds 5,000 characters
Solution: Split text into chunks and make multiple API calls
Wrapping up
Gemini TTS API is a powerful tool for converting text into human-sounding speech. Its key strengths are flexible prompt-based expression control, multi-speaker support, and affordable pricing. Whether you're creating podcasts, narrating videos, or developing games, Gemini TTS enables new possibilities for audio content creation.
Get your API key today and start with "Hello, world!"
Reference Books: AI and Machine Learning Engineer's Implementation Guide (O'Reilly)