Building Emotion-Aware Voice Apps with the Gemini Live API — A

Setup and context — The Era of Emotion-Aware Voice Applications

With the arrival of Gemini 3.1 Flash Live, building voice applications that understand and respond to user emotions has become practical reality. While previous voice AI focused on text analysis of speech content, Flash Live estimates emotions from acoustic features like pitch, speed, and intonation.

Prerequisites and Setup

Requirements

Python 3.11 or later
Google AI Studio API key (with Gemini Live API preview access)
pyaudio or sounddevice (for microphone input)
WebSocket-compatible library

Installation

# Install required libraries
pip install google-generativeai pyaudio websockets numpy
 
# Set environment variable
export GEMINI_API_KEY="YOUR_GEMINI_API_KEY"

Understanding the Live API Architecture

Communication Model

The Gemini Live API uses WebSocket-based bidirectional real-time communication. Unlike HTTP request/response, you can stream audio data continuously while simultaneously receiving responses.

[Client] ←WebSocket→ [Gemini Live API]
    │                           │
    ├─ Send audio chunk ──────→ │
    │                           ├─ Real-time analysis
    ├─ Send audio chunk ──────→ │
    │                           ├─ Emotion estimation + transcription
    │ ←───── Response (audio+text) ┤
    │                           │

Session Lifecycle

Live API sessions operate through this lifecycle.

Connect: Establish WebSocket connection, send model and session configuration
Setup: Configure audio format, response modalities, voice presets
Stream: Continuously send and receive audio chunks
Disconnect: End session and release resources

Core Implementation: Real-Time Voice Conversation

Step 1: Establishing a Session

import google.generativeai as genai
import os
 
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
 
model = genai.GenerativeModel("gemini-3.1-flash-live")
 
# Session configuration
session_config = {
    "response_modalities": ["AUDIO", "TEXT"],
    "speech_config": {
        "voice_config": {
            "prebuilt_voice_config": {
                "voice_name": "Aoede"
            }
        }
    },
    "system_instruction": """
    You are an emotionally aware conversational assistant.
    Estimate the user's emotions from their vocal tone and
    speaking style, then adjust your response accordingly.
    If they sound rushed, be concise and reassuring.
    If they sound down, show empathy before providing information.
    """
}
 
session = model.start_live_session(config=session_config)
print("Session established")
# Output: Session established

Step 2: Audio Input/Output Handling

import pyaudio
import numpy as np
import threading
 
# Audio configuration
RATE = 16000       # Sample rate (16kHz)
CHUNK = 1024       # Chunk size
FORMAT = pyaudio.paInt16
CHANNELS = 1
 
audio = pyaudio.PyAudio()
 
# Microphone input stream
input_stream = audio.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    frames_per_buffer=CHUNK
)
 
# Speaker output stream
output_stream = audio.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    output=True,
    frames_per_buffer=CHUNK
)
 
def send_audio(session):
    """Send microphone audio to the Live API"""
    while True:
        data = input_stream.read(CHUNK, exception_on_overflow=False)
        audio_array = np.frombuffer(data, dtype=np.int16)
 
        # Silence detection (skip below threshold)
        if np.abs(audio_array).mean() < 100:
            continue
 
        session.send_audio(data)
 
def receive_response(session):
    """Receive and play responses from the Live API"""
    for response in session.receive():
        if response.audio:
            output_stream.write(response.audio)
        if response.text:
            print(f"[Gemini] {response.text}")
 
# Run send/receive on separate threads
send_thread = threading.Thread(target=send_audio, args=(session,))
recv_thread = threading.Thread(target=receive_response, args=(session,))
send_thread.start()
recv_thread.start()

Emotion Recognition Layer Design Patterns

Leveraging Flash Live's emotion recognition requires careful design at two layers: system prompt design and response control.

Pattern 1: Prompt-Driven

The simplest approach. Encode emotion-handling rules in the system prompt and let the model's internal judgment handle the rest.

EMOTION_AWARE_PROMPT = """
You are a highly empathetic conversational assistant.
Follow these rules:
 
1. If the user sounds rushed or anxious:
   - Respond in a calm tone
   - Deliver key points concisely
   - Use reassuring language like "No worries, let me help"
 
2. If the user sounds sad or discouraged:
   - Show empathy before providing information
   - Use accepting language like "I understand that's frustrating"
   - Present solutions as options, not directives
 
3. If the user sounds happy or enthusiastic:
   - Match their energy with an upbeat response
   - Feel free to include light humor
 
4. If the user sounds angry or frustrated:
   - Lead with acknowledgment and empathy
   - Focus on problem-solving with concrete steps
   - Avoid defensiveness; offer actionable solutions
"""

Pattern 2: Metadata Extraction

Have the model return structured emotion metadata, giving your application layer fine-grained control over response logic.

METADATA_EXTRACTION_PROMPT = """
For each user voice input, return two things:
 
1. JSON emotion metadata (at the start of your text response):
{
  "emotion": "happy|sad|angry|anxious|neutral",
  "confidence": 0.0-1.0,
  "energy_level": "high|medium|low"
}
 
2. Then your normal response text
 
Example:
{"emotion": "anxious", "confidence": 0.8, "energy_level": "high"}
You seem to be in a hurry. Let me get right to your question.
"""
 
import json
 
def process_response_with_emotion(response_text):
    """Extract emotion metadata from response text"""
    lines = response_text.strip().split("\n", 1)
 
    try:
        emotion_data = json.loads(lines[0])
        reply_text = lines[1] if len(lines) > 1 else ""
        return emotion_data, reply_text
    except json.JSONDecodeError:
        return {"emotion": "neutral", "confidence": 0.5}, response_text
 
# Usage example
emotion, reply = process_response_with_emotion(response.text)
print(f"Detected emotion: {emotion['emotion']} ({emotion['confidence']:.0%})")
print(f"Reply: {reply}")
# Example output:
# Detected emotion: anxious (80%)
# Reply: You seem to be in a hurry. Let me get right to your question.

Pattern 3: Multi-Turn Emotion Tracking

Track emotion transitions across the entire conversation and adjust responses based on long-term patterns.

from collections import deque
from datetime import datetime
 
class EmotionTracker:
    """Tracks emotion transitions during a conversation"""
 
    def __init__(self, window_size=10):
        self.history = deque(maxlen=window_size)
 
    def add(self, emotion_data):
        self.history.append({
            "timestamp": datetime.now().isoformat(),
            **emotion_data
        })
 
    def get_trend(self):
        """Analyze emotion trend"""
        if len(self.history) < 3:
            return "insufficient_data"
 
        recent = list(self.history)[-3:]
        emotions = [e["emotion"] for e in recent]
 
        # Consistently negative — may need special handling
        if all(e in ("sad", "angry", "anxious") for e in emotions):
            return "persistent_negative"
 
        # Improving trajectory
        if emotions[-1] in ("happy", "neutral") and emotions[0] in ("sad", "angry"):
            return "improving"
 
        return "stable"
 
    def should_escalate(self):
        """Determine if escalation is needed"""
        trend = self.get_trend()
        return trend == "persistent_negative"
 
# Usage example
tracker = EmotionTracker()
tracker.add({"emotion": "anxious", "confidence": 0.8, "energy_level": "high"})
tracker.add({"emotion": "angry", "confidence": 0.7, "energy_level": "high"})
tracker.add({"emotion": "angry", "confidence": 0.9, "energy_level": "high"})
 
if tracker.should_escalate():
    print("Escalation recommended: persistent negative emotion detected")
# Output: Escalation recommended: persistent negative emotion detected

Use Case Architectures

Use Case 1: Customer Support

[Customer voice] → [Flash Live API]
                    │
                    ├── Emotion analysis → [EmotionTracker]
                    │                           │
                    │                    Escalation check
                    │                           │
                    │                    ├─ Normal → AI responds
                    │                    └─ Escalate → Transfer to human agent
                    │
                    └── Transcription → [Knowledge base search]
                                           │
                                    [Answer generation + speech synthesis]

Use Case 2: Education App

[Learner's voice] → [Flash Live API]
                    │
                    ├── Comprehension estimation (from vocal tone)
                    │       │
                    │       ├─ Understanding → Advance to next topic
                    │       ├─ Confused → Try different explanation approach
                    │       └─ Bored → Switch to interactive exercises
                    │
                    └── Question analysis → [Course material database]

Use Case 3: Healthcare (Wellness Check)

[User's voice] → [Flash Live API]
                    │
                    ├── Emotion trend analysis
                    │       │
                    │       ├─ Stable → Continue normal conversation
                    │       ├─ Improving → Provide positive feedback
                    │       └─ Persistent negative → Suggest professional consultation
                    │
                    └── Daily report generation → [Dashboard]

Performance Optimization

Minimizing Latency

For a natural conversational experience, aim to keep end-to-end latency below 200ms.

Audio chunk size: Send small chunks (512–1024 samples) for responsive recognition
Buffering strategy: Minimize input buffering; absorb network jitter with output buffering
Region selection: Choose the API endpoint geographically closest to your users

Ensuring Audio Quality

# Noise gate: Remove ambient noise
def apply_noise_gate(audio_chunk, threshold=200):
    """Replace sub-threshold audio with silence"""
    audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
    mask = np.abs(audio_array) < threshold
    audio_array[mask] = 0
    return audio_array.tobytes()
 
# Volume normalization: Keep input levels consistent
def normalize_volume(audio_chunk, target_rms=3000):
    """Normalize volume based on RMS value"""
    audio_array = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32)
    current_rms = np.sqrt(np.mean(audio_array ** 2))
    if current_rms > 0:
        gain = target_rms / current_rms
        audio_array = np.clip(audio_array * gain, -32768, 32767)
    return audio_array.astype(np.int16).tobytes()

Security and Privacy Considerations

Voice data is among the most sensitive categories of personal information. Keep these points in mind during implementation.

Voice data storage: Design the system to avoid persisting raw audio data server-side; keep everything as streaming
Emotion data handling: Anonymize emotion analysis results and avoid linking them to individual identities
User disclosure: Clearly inform users that emotion analysis is active and obtain consent
Data encryption: Use TLS (wss://) for all WebSocket communication

Summary

The Gemini 3.1 Flash Live API adds a new dimension to voice application development. Combining acoustic emotion understanding, 128,000-token context retention, and real-time streaming enables emotion-responsive voice applications across customer support, education, healthcare, and beyond.