Setup and context — The Era of Emotion-Aware Voice Applications
With the arrival of Gemini 3.1 Flash Live, building voice applications that understand and respond to user emotions has become practical reality. While previous voice AI focused on text analysis of speech content, Flash Live estimates emotions from acoustic features like pitch, speed, and intonation.
Prerequisites and Setup
Requirements
- Python 3.11 or later
- Google AI Studio API key (with Gemini Live API preview access)
pyaudioorsounddevice(for microphone input)- WebSocket-compatible library
Installation
# Install required libraries
pip install google-generativeai pyaudio websockets numpy
# Set environment variable
export GEMINI_API_KEY="YOUR_GEMINI_API_KEY"Understanding the Live API Architecture
Communication Model
The Gemini Live API uses WebSocket-based bidirectional real-time communication. Unlike HTTP request/response, you can stream audio data continuously while simultaneously receiving responses.
[Client] ←WebSocket→ [Gemini Live API]
│ │
├─ Send audio chunk ──────→ │
│ ├─ Real-time analysis
├─ Send audio chunk ──────→ │
│ ├─ Emotion estimation + transcription
│ ←───── Response (audio+text) ┤
│ │
Session Lifecycle
Live API sessions operate through this lifecycle.
- Connect: Establish WebSocket connection, send model and session configuration
- Setup: Configure audio format, response modalities, voice presets
- Stream: Continuously send and receive audio chunks
- Disconnect: End session and release resources
Core Implementation: Real-Time Voice Conversation
Step 1: Establishing a Session
import google.generativeai as genai
import os
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-3.1-flash-live")
# Session configuration
session_config = {
"response_modalities": ["AUDIO", "TEXT"],
"speech_config": {
"voice_config": {
"prebuilt_voice_config": {
"voice_name": "Aoede"
}
}
},
"system_instruction": """
You are an emotionally aware conversational assistant.
Estimate the user's emotions from their vocal tone and
speaking style, then adjust your response accordingly.
If they sound rushed, be concise and reassuring.
If they sound down, show empathy before providing information.
"""
}
session = model.start_live_session(config=session_config)
print("Session established")
# Output: Session establishedStep 2: Audio Input/Output Handling
import pyaudio
import numpy as np
import threading
# Audio configuration
RATE = 16000 # Sample rate (16kHz)
CHUNK = 1024 # Chunk size
FORMAT = pyaudio.paInt16
CHANNELS = 1
audio = pyaudio.PyAudio()
# Microphone input stream
input_stream = audio.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK
)
# Speaker output stream
output_stream = audio.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
output=True,
frames_per_buffer=CHUNK
)
def send_audio(session):
"""Send microphone audio to the Live API"""
while True:
data = input_stream.read(CHUNK, exception_on_overflow=False)
audio_array = np.frombuffer(data, dtype=np.int16)
# Silence detection (skip below threshold)
if np.abs(audio_array).mean() < 100:
continue
session.send_audio(data)
def receive_response(session):
"""Receive and play responses from the Live API"""
for response in session.receive():
if response.audio:
output_stream.write(response.audio)
if response.text:
print(f"[Gemini] {response.text}")
# Run send/receive on separate threads
send_thread = threading.Thread(target=send_audio, args=(session,))
recv_thread = threading.Thread(target=receive_response, args=(session,))
send_thread.start()
recv_thread.start()Emotion Recognition Layer Design Patterns
Leveraging Flash Live's emotion recognition requires careful design at two layers: system prompt design and response control.
Pattern 1: Prompt-Driven
The simplest approach. Encode emotion-handling rules in the system prompt and let the model's internal judgment handle the rest.
EMOTION_AWARE_PROMPT = """
You are a highly empathetic conversational assistant.
Follow these rules:
1. If the user sounds rushed or anxious:
- Respond in a calm tone
- Deliver key points concisely
- Use reassuring language like "No worries, let me help"
2. If the user sounds sad or discouraged:
- Show empathy before providing information
- Use accepting language like "I understand that's frustrating"
- Present solutions as options, not directives
3. If the user sounds happy or enthusiastic:
- Match their energy with an upbeat response
- Feel free to include light humor
4. If the user sounds angry or frustrated:
- Lead with acknowledgment and empathy
- Focus on problem-solving with concrete steps
- Avoid defensiveness; offer actionable solutions
"""Pattern 2: Metadata Extraction
Have the model return structured emotion metadata, giving your application layer fine-grained control over response logic.
METADATA_EXTRACTION_PROMPT = """
For each user voice input, return two things:
1. JSON emotion metadata (at the start of your text response):
{
"emotion": "happy|sad|angry|anxious|neutral",
"confidence": 0.0-1.0,
"energy_level": "high|medium|low"
}
2. Then your normal response text
Example:
{"emotion": "anxious", "confidence": 0.8, "energy_level": "high"}
You seem to be in a hurry. Let me get right to your question.
"""
import json
def process_response_with_emotion(response_text):
"""Extract emotion metadata from response text"""
lines = response_text.strip().split("\n", 1)
try:
emotion_data = json.loads(lines[0])
reply_text = lines[1] if len(lines) > 1 else ""
return emotion_data, reply_text
except json.JSONDecodeError:
return {"emotion": "neutral", "confidence": 0.5}, response_text
# Usage example
emotion, reply = process_response_with_emotion(response.text)
print(f"Detected emotion: {emotion['emotion']} ({emotion['confidence']:.0%})")
print(f"Reply: {reply}")
# Example output:
# Detected emotion: anxious (80%)
# Reply: You seem to be in a hurry. Let me get right to your question.Pattern 3: Multi-Turn Emotion Tracking
Track emotion transitions across the entire conversation and adjust responses based on long-term patterns.
from collections import deque
from datetime import datetime
class EmotionTracker:
"""Tracks emotion transitions during a conversation"""
def __init__(self, window_size=10):
self.history = deque(maxlen=window_size)
def add(self, emotion_data):
self.history.append({
"timestamp": datetime.now().isoformat(),
**emotion_data
})
def get_trend(self):
"""Analyze emotion trend"""
if len(self.history) < 3:
return "insufficient_data"
recent = list(self.history)[-3:]
emotions = [e["emotion"] for e in recent]
# Consistently negative — may need special handling
if all(e in ("sad", "angry", "anxious") for e in emotions):
return "persistent_negative"
# Improving trajectory
if emotions[-1] in ("happy", "neutral") and emotions[0] in ("sad", "angry"):
return "improving"
return "stable"
def should_escalate(self):
"""Determine if escalation is needed"""
trend = self.get_trend()
return trend == "persistent_negative"
# Usage example
tracker = EmotionTracker()
tracker.add({"emotion": "anxious", "confidence": 0.8, "energy_level": "high"})
tracker.add({"emotion": "angry", "confidence": 0.7, "energy_level": "high"})
tracker.add({"emotion": "angry", "confidence": 0.9, "energy_level": "high"})
if tracker.should_escalate():
print("Escalation recommended: persistent negative emotion detected")
# Output: Escalation recommended: persistent negative emotion detectedUse Case Architectures
Use Case 1: Customer Support
[Customer voice] → [Flash Live API]
│
├── Emotion analysis → [EmotionTracker]
│ │
│ Escalation check
│ │
│ ├─ Normal → AI responds
│ └─ Escalate → Transfer to human agent
│
└── Transcription → [Knowledge base search]
│
[Answer generation + speech synthesis]
Use Case 2: Education App
[Learner's voice] → [Flash Live API]
│
├── Comprehension estimation (from vocal tone)
│ │
│ ├─ Understanding → Advance to next topic
│ ├─ Confused → Try different explanation approach
│ └─ Bored → Switch to interactive exercises
│
└── Question analysis → [Course material database]
Use Case 3: Healthcare (Wellness Check)
[User's voice] → [Flash Live API]
│
├── Emotion trend analysis
│ │
│ ├─ Stable → Continue normal conversation
│ ├─ Improving → Provide positive feedback
│ └─ Persistent negative → Suggest professional consultation
│
└── Daily report generation → [Dashboard]
Performance Optimization
Minimizing Latency
For a natural conversational experience, aim to keep end-to-end latency below 200ms.
- Audio chunk size: Send small chunks (512–1024 samples) for responsive recognition
- Buffering strategy: Minimize input buffering; absorb network jitter with output buffering
- Region selection: Choose the API endpoint geographically closest to your users
Ensuring Audio Quality
# Noise gate: Remove ambient noise
def apply_noise_gate(audio_chunk, threshold=200):
"""Replace sub-threshold audio with silence"""
audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
mask = np.abs(audio_array) < threshold
audio_array[mask] = 0
return audio_array.tobytes()
# Volume normalization: Keep input levels consistent
def normalize_volume(audio_chunk, target_rms=3000):
"""Normalize volume based on RMS value"""
audio_array = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32)
current_rms = np.sqrt(np.mean(audio_array ** 2))
if current_rms > 0:
gain = target_rms / current_rms
audio_array = np.clip(audio_array * gain, -32768, 32767)
return audio_array.astype(np.int16).tobytes()Security and Privacy Considerations
Voice data is among the most sensitive categories of personal information. Keep these points in mind during implementation.
- Voice data storage: Design the system to avoid persisting raw audio data server-side; keep everything as streaming
- Emotion data handling: Anonymize emotion analysis results and avoid linking them to individual identities
- User disclosure: Clearly inform users that emotion analysis is active and obtain consent
- Data encryption: Use TLS (wss://) for all WebSocket communication
Summary
The Gemini 3.1 Flash Live API adds a new dimension to voice application development. Combining acoustic emotion understanding, 128,000-token context retention, and real-time streaming enables emotion-responsive voice applications across customer support, education, healthcare, and beyond.