GEMINI LABJP
SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soonSIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
Articles/API / SDK
API / SDK/2026-03-29Advanced

Building Emotion-Aware Voice Apps with the Gemini Live API — A

Build voice applications that analyze user emotions in real time using the Gemini 3.1 Flash Live API. Covers architecture design patterns, implementation code, and optimization strategies.

gemini-api285live-api5voice-appemotion-recognitionreal-time3python132

Setup and context — The Era of Emotion-Aware Voice Applications

With the arrival of Gemini 3.1 Flash Live, building voice applications that understand and respond to user emotions has become practical reality. While previous voice AI focused on text analysis of speech content, Flash Live estimates emotions from acoustic features like pitch, speed, and intonation.

Prerequisites and Setup

Requirements

  • Python 3.11 or later
  • Google AI Studio API key (with Gemini Live API preview access)
  • pyaudio or sounddevice (for microphone input)
  • WebSocket-compatible library

Installation

# Install required libraries
pip install google-generativeai pyaudio websockets numpy
 
# Set environment variable
export GEMINI_API_KEY="YOUR_GEMINI_API_KEY"

Understanding the Live API Architecture

Communication Model

The Gemini Live API uses WebSocket-based bidirectional real-time communication. Unlike HTTP request/response, you can stream audio data continuously while simultaneously receiving responses.

[Client] ←WebSocket→ [Gemini Live API]
    │                           │
    ├─ Send audio chunk ──────→ │
    │                           ├─ Real-time analysis
    ├─ Send audio chunk ──────→ │
    │                           ├─ Emotion estimation + transcription
    │ ←───── Response (audio+text) ┤
    │                           │

Session Lifecycle

Live API sessions operate through this lifecycle.

  • Connect: Establish WebSocket connection, send model and session configuration
  • Setup: Configure audio format, response modalities, voice presets
  • Stream: Continuously send and receive audio chunks
  • Disconnect: End session and release resources

Core Implementation: Real-Time Voice Conversation

Step 1: Establishing a Session

import google.generativeai as genai
import os
 
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
 
model = genai.GenerativeModel("gemini-3.1-flash-live")
 
# Session configuration
session_config = {
    "response_modalities": ["AUDIO", "TEXT"],
    "speech_config": {
        "voice_config": {
            "prebuilt_voice_config": {
                "voice_name": "Aoede"
            }
        }
    },
    "system_instruction": """
    You are an emotionally aware conversational assistant.
    Estimate the user's emotions from their vocal tone and
    speaking style, then adjust your response accordingly.
    If they sound rushed, be concise and reassuring.
    If they sound down, show empathy before providing information.
    """
}
 
session = model.start_live_session(config=session_config)
print("Session established")
# Output: Session established

Step 2: Audio Input/Output Handling

import pyaudio
import numpy as np
import threading
 
# Audio configuration
RATE = 16000       # Sample rate (16kHz)
CHUNK = 1024       # Chunk size
FORMAT = pyaudio.paInt16
CHANNELS = 1
 
audio = pyaudio.PyAudio()
 
# Microphone input stream
input_stream = audio.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    frames_per_buffer=CHUNK
)
 
# Speaker output stream
output_stream = audio.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    output=True,
    frames_per_buffer=CHUNK
)
 
def send_audio(session):
    """Send microphone audio to the Live API"""
    while True:
        data = input_stream.read(CHUNK, exception_on_overflow=False)
        audio_array = np.frombuffer(data, dtype=np.int16)
 
        # Silence detection (skip below threshold)
        if np.abs(audio_array).mean() < 100:
            continue
 
        session.send_audio(data)
 
def receive_response(session):
    """Receive and play responses from the Live API"""
    for response in session.receive():
        if response.audio:
            output_stream.write(response.audio)
        if response.text:
            print(f"[Gemini] {response.text}")
 
# Run send/receive on separate threads
send_thread = threading.Thread(target=send_audio, args=(session,))
recv_thread = threading.Thread(target=receive_response, args=(session,))
send_thread.start()
recv_thread.start()

Emotion Recognition Layer Design Patterns

Leveraging Flash Live's emotion recognition requires careful design at two layers: system prompt design and response control.

Pattern 1: Prompt-Driven

The simplest approach. Encode emotion-handling rules in the system prompt and let the model's internal judgment handle the rest.

EMOTION_AWARE_PROMPT = """
You are a highly empathetic conversational assistant.
Follow these rules:
 
1. If the user sounds rushed or anxious:
   - Respond in a calm tone
   - Deliver key points concisely
   - Use reassuring language like "No worries, let me help"
 
2. If the user sounds sad or discouraged:
   - Show empathy before providing information
   - Use accepting language like "I understand that's frustrating"
   - Present solutions as options, not directives
 
3. If the user sounds happy or enthusiastic:
   - Match their energy with an upbeat response
   - Feel free to include light humor
 
4. If the user sounds angry or frustrated:
   - Lead with acknowledgment and empathy
   - Focus on problem-solving with concrete steps
   - Avoid defensiveness; offer actionable solutions
"""

Pattern 2: Metadata Extraction

Have the model return structured emotion metadata, giving your application layer fine-grained control over response logic.

METADATA_EXTRACTION_PROMPT = """
For each user voice input, return two things:
 
1. JSON emotion metadata (at the start of your text response):
{
  "emotion": "happy|sad|angry|anxious|neutral",
  "confidence": 0.0-1.0,
  "energy_level": "high|medium|low"
}
 
2. Then your normal response text
 
Example:
{"emotion": "anxious", "confidence": 0.8, "energy_level": "high"}
You seem to be in a hurry. Let me get right to your question.
"""
 
import json
 
def process_response_with_emotion(response_text):
    """Extract emotion metadata from response text"""
    lines = response_text.strip().split("\n", 1)
 
    try:
        emotion_data = json.loads(lines[0])
        reply_text = lines[1] if len(lines) > 1 else ""
        return emotion_data, reply_text
    except json.JSONDecodeError:
        return {"emotion": "neutral", "confidence": 0.5}, response_text
 
# Usage example
emotion, reply = process_response_with_emotion(response.text)
print(f"Detected emotion: {emotion['emotion']} ({emotion['confidence']:.0%})")
print(f"Reply: {reply}")
# Example output:
# Detected emotion: anxious (80%)
# Reply: You seem to be in a hurry. Let me get right to your question.

Pattern 3: Multi-Turn Emotion Tracking

Track emotion transitions across the entire conversation and adjust responses based on long-term patterns.

from collections import deque
from datetime import datetime
 
class EmotionTracker:
    """Tracks emotion transitions during a conversation"""
 
    def __init__(self, window_size=10):
        self.history = deque(maxlen=window_size)
 
    def add(self, emotion_data):
        self.history.append({
            "timestamp": datetime.now().isoformat(),
            **emotion_data
        })
 
    def get_trend(self):
        """Analyze emotion trend"""
        if len(self.history) < 3:
            return "insufficient_data"
 
        recent = list(self.history)[-3:]
        emotions = [e["emotion"] for e in recent]
 
        # Consistently negative — may need special handling
        if all(e in ("sad", "angry", "anxious") for e in emotions):
            return "persistent_negative"
 
        # Improving trajectory
        if emotions[-1] in ("happy", "neutral") and emotions[0] in ("sad", "angry"):
            return "improving"
 
        return "stable"
 
    def should_escalate(self):
        """Determine if escalation is needed"""
        trend = self.get_trend()
        return trend == "persistent_negative"
 
# Usage example
tracker = EmotionTracker()
tracker.add({"emotion": "anxious", "confidence": 0.8, "energy_level": "high"})
tracker.add({"emotion": "angry", "confidence": 0.7, "energy_level": "high"})
tracker.add({"emotion": "angry", "confidence": 0.9, "energy_level": "high"})
 
if tracker.should_escalate():
    print("Escalation recommended: persistent negative emotion detected")
# Output: Escalation recommended: persistent negative emotion detected

Use Case Architectures

Use Case 1: Customer Support

[Customer voice] → [Flash Live API]
                    │
                    ├── Emotion analysis → [EmotionTracker]
                    │                           │
                    │                    Escalation check
                    │                           │
                    │                    ├─ Normal → AI responds
                    │                    └─ Escalate → Transfer to human agent
                    │
                    └── Transcription → [Knowledge base search]
                                           │
                                    [Answer generation + speech synthesis]

Use Case 2: Education App

[Learner's voice] → [Flash Live API]
                    │
                    ├── Comprehension estimation (from vocal tone)
                    │       │
                    │       ├─ Understanding → Advance to next topic
                    │       ├─ Confused → Try different explanation approach
                    │       └─ Bored → Switch to interactive exercises
                    │
                    └── Question analysis → [Course material database]

Use Case 3: Healthcare (Wellness Check)

[User's voice] → [Flash Live API]
                    │
                    ├── Emotion trend analysis
                    │       │
                    │       ├─ Stable → Continue normal conversation
                    │       ├─ Improving → Provide positive feedback
                    │       └─ Persistent negative → Suggest professional consultation
                    │
                    └── Daily report generation → [Dashboard]

Performance Optimization

Minimizing Latency

For a natural conversational experience, aim to keep end-to-end latency below 200ms.

  • Audio chunk size: Send small chunks (512–1024 samples) for responsive recognition
  • Buffering strategy: Minimize input buffering; absorb network jitter with output buffering
  • Region selection: Choose the API endpoint geographically closest to your users

Ensuring Audio Quality

# Noise gate: Remove ambient noise
def apply_noise_gate(audio_chunk, threshold=200):
    """Replace sub-threshold audio with silence"""
    audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
    mask = np.abs(audio_array) < threshold
    audio_array[mask] = 0
    return audio_array.tobytes()
 
# Volume normalization: Keep input levels consistent
def normalize_volume(audio_chunk, target_rms=3000):
    """Normalize volume based on RMS value"""
    audio_array = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32)
    current_rms = np.sqrt(np.mean(audio_array ** 2))
    if current_rms > 0:
        gain = target_rms / current_rms
        audio_array = np.clip(audio_array * gain, -32768, 32767)
    return audio_array.astype(np.int16).tobytes()

Security and Privacy Considerations

Voice data is among the most sensitive categories of personal information. Keep these points in mind during implementation.

  • Voice data storage: Design the system to avoid persisting raw audio data server-side; keep everything as streaming
  • Emotion data handling: Anonymize emotion analysis results and avoid linking them to individual identities
  • User disclosure: Clearly inform users that emotion analysis is active and obtain consent
  • Data encryption: Use TLS (wss://) for all WebSocket communication

Summary

The Gemini 3.1 Flash Live API adds a new dimension to voice application development. Combining acoustic emotion understanding, 128,000-token context retention, and real-time streaming enables emotion-responsive voice applications across customer support, education, healthcare, and beyond.

Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

API / SDK2026-06-01
Empty Output but finish_reason Is MAX_TOKENS on Gemini 2.5/3: Cause and Fix
Your prompt is just a few lines, yet a low maxOutputTokens on gemini-2.5-flash returns empty text with finish_reason MAX_TOKENS. The culprit is thinking tokens. Here are three fixes with working code.
API / SDK2026-05-30
Why Gemini 2.5 Pro Rejects thinkingBudget: 0 (and How to Fix It)
Setting thinkingBudget to 0 on Gemini 2.5 Pro returns a 400 INVALID_ARGUMENT error. Here is why the per-model thinking budget ranges differ, how to minimize thinking on Pro the right way, and when to switch to Flash, with Python and JavaScript examples.
API / SDK2026-05-28
Why per-turn generationConfig is ignored in Gemini API chat sessions
If you pass a different generationConfig (temperature, max_output_tokens, response_schema) to each send_message in a google-genai chat session and the behavior never changes, this walkthrough shows what is actually happening, why the SDK is designed that way, and three workarounds we use in production for review-summary and reply-draft pipelines.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →