Gemini 2.0 Multimodal and Live API: The Complete Guide

Gemini 2.0 was designed from the ground up as a genuinely multimodal AI — not just a language model with vision tacked on. It understands images, video, and audio simultaneously, and with the Live API, it supports real-time, bidirectional streaming conversations. This opens up a category of applications that simply wasn't possible with earlier generation models.

1. Multimodal Processing in Gemini 2.0

Supported Input Modalities

Gemini 2.0 Flash (and Gemini 2.0 Pro) support the following input types:

Text: Natural language, code, structured data
Images: JPEG, PNG, WebP, GIF (static)
Video: MP4, MOV, AVI, and other major formats (up to 1 hour)
Audio: MP3, WAV, FLAC, AAC, and more
PDF: Document understanding including text extraction

These can be sent individually or in any combination in a single request.

Image Understanding

import google.generativeai as genai
from pathlib import Path
 
genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
 
# Read a local image file
image_path = Path("screenshot.png")
image_data = image_path.read_bytes()
 
response = model.generate_content([
    {
        "mime_type": "image/png",
        "data": image_data
    },
    "Analyze this screenshot's UI and identify five specific UX improvements."
])
 
print(response.text)

Using Base64 encoding

import base64
 
with open("chart.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")
 
response = model.generate_content([
    {
        "inline_data": {
            "mime_type": "image/jpeg",
            "data": image_data
        }
    },
    "Extract the data from this chart in tabular format, then analyze the key trends."
])

Video Understanding

Gemini 2.0 can process video files directly — summarizing content, finding specific scenes, or answering questions about what's in the video.

# Upload and process a video using the File API
import google.generativeai as genai
import time
 
genai.configure(api_key="YOUR_GEMINI_API_KEY")
 
# Upload the video file
print("Uploading video...")
video_file = genai.upload_file("tutorial_video.mp4")
 
# Wait for processing to complete
while video_file.state.name == "PROCESSING":
    print("Processing...")
    time.sleep(5)
    video_file = genai.get_file(video_file.name)
 
if video_file.state.name == "FAILED":
    raise ValueError("Video upload failed")
 
# Query the video
model = genai.GenerativeModel("gemini-2.0-flash")
 
response = model.generate_content([
    video_file,
    "Please summarize this video in the following format:\n"
    "1. Main theme\n"
    "2. Key points (3–5)\n"
    "3. Conclusion or next actions"
])
 
print(response.text)
 
# Clean up the uploaded file
genai.delete_file(video_file.name)

Audio Processing

# Transcribe and analyze an audio recording
with open("meeting_recording.mp3", "rb") as f:
    audio_data = f.read()
 
response = model.generate_content([
    {
        "inline_data": {
            "mime_type": "audio/mp3",
            "data": base64.b64encode(audio_data).decode("utf-8")
        }
    },
    "Transcribe this meeting recording and extract:\n"
    "1. Decisions made\n"
    "2. Action items (with owner and deadline)\n"
    "3. Open questions"
])

2. Combined Multimodal Requests

Processing Multiple Images in One Request

Gemini 2.0 can receive and reason about multiple images in a single request.

# Compare two images
before_image = Path("before.png").read_bytes()
after_image = Path("after.png").read_bytes()
 
response = model.generate_content([
    "Before image:",
    {"mime_type": "image/png", "data": before_image},
    "After image:",
    {"mime_type": "image/png", "data": after_image},
    "Compare these two images in detail. List every design change, addition, and removal you can identify."
])

Image + Text + Structured Output

import json
 
# Extract structured data from a product image
with open("product.jpg", "rb") as f:
    product_image = base64.b64encode(f.read()).decode("utf-8")
 
response = model.generate_content(
    contents=[
        {
            "inline_data": {
                "mime_type": "image/jpeg",
                "data": product_image
            }
        },
        """Extract information from this product image in the following JSON format:
        {
          "product_name": "product name",
          "category": "category",
          "colors": ["color1", "color2"],
          "estimated_price_range": "price range (e.g., $10–$50)",
          "features": ["feature1", "feature2", "feature3"],
          "target_audience": "target customer",
          "condition": "new or used"
        }
        Output JSON only."""
    ],
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json"
    )
)
 
product_data = json.loads(response.text)
print(product_data)

3. Real-Time Streaming with the Live API

What Is the Live API?

The Live API enables bidirectional real-time streaming with Gemini 2.0. It supports streaming audio, text, and images with low latency, making it possible to build truly conversational, real-time multimodal applications.

Key characteristics:

WebSocket-based bidirectional communication
Real-time voice input and voice output
Real-time processing of screen share or camera feeds
Support for user interruptions

Text Streaming

Start with the simplest case: streaming text responses.

# Standard streaming with the REST API
model = genai.GenerativeModel("gemini-2.0-flash")
 
response = model.generate_content(
    "Explain how quantum computers work in detail.",
    stream=True
)
 
for chunk in response:
    if chunk.text:
        print(chunk.text, end="", flush=True)
print()

Live API (WebSocket)

The Live API provides more advanced real-time capabilities.

import asyncio
from google import genai
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
async def live_conversation():
    """Real-time conversation using the Live API."""
    config = {
        "response_modalities": ["TEXT"],
        "system_instruction": "You are a helpful and friendly assistant."
    }
 
    async with client.aio.live.connect(
        model="gemini-2.0-flash-live-001",
        config=config
    ) as session:
        print("Live API session started")
 
        # Send a text message
        await session.send(input="Hello! Can you tell me about today's weather?", end_of_turn=True)
 
        # Receive the response
        async for response in session.receive():
            if response.text:
                print(f"Gemini: {response.text}", end="", flush=True)
            if response.server_content and response.server_content.turn_complete:
                print()
                break
 
asyncio.run(live_conversation())

Voice-In, Voice-Out with the Live API

import asyncio
import pyaudio
from google import genai
from google.genai import types
 
SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK = 1024
 
async def voice_conversation():
    """Real-time voice conversation using the Live API."""
 
    client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
    config = types.LiveConnectConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(
                    voice_name="Kore"
                )
            )
        )
    )
 
    audio = pyaudio.PyAudio()
 
    # Microphone input stream
    input_stream = audio.open(
        format=pyaudio.paInt16,
        channels=CHANNELS,
        rate=SAMPLE_RATE,
        input=True,
        frames_per_buffer=CHUNK
    )
 
    # Speaker output stream
    output_stream = audio.open(
        format=pyaudio.paInt16,
        channels=CHANNELS,
        rate=24000,
        output=True
    )
 
    async with client.aio.live.connect(
        model="gemini-2.0-flash-live-001",
        config=config
    ) as session:
 
        async def send_audio():
            """Stream microphone audio to Gemini."""
            while True:
                data = input_stream.read(CHUNK, exception_on_overflow=False)
                await session.send(input={"data": data, "mime_type": "audio/pcm"})
                await asyncio.sleep(0.01)
 
        async def receive_audio():
            """Play back audio from Gemini."""
            async for response in session.receive():
                if response.data:
                    output_stream.write(response.data)
 
        # Run send and receive in parallel
        await asyncio.gather(
            send_audio(),
            receive_audio()
        )
 
asyncio.run(voice_conversation())

4. Combining Real-Time Image Processing

Real-Time Camera Analysis

Combining the Live API with camera input enables real-time visual understanding.

import asyncio
import cv2
import base64
from google import genai
from google.genai import types
 
async def realtime_camera_analysis():
    """Continuously analyze camera frames with Gemini."""
 
    client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
    config = types.LiveConnectConfig(
        response_modalities=["TEXT"],
        system_instruction="Continuously analyze what you see in the camera feed. "
                          "Report any significant changes."
    )
 
    cap = cv2.VideoCapture(0)
 
    async with client.aio.live.connect(
        model="gemini-2.0-flash-live-001",
        config=config
    ) as session:
 
        frame_count = 0
 
        while True:
            ret, frame = cap.read()
            if not ret:
                break
 
            # Send one frame per second (frame rate control)
            if frame_count % 30 == 0:
                _, buffer = cv2.imencode(".jpg", frame)
                image_data = base64.b64encode(buffer).decode("utf-8")
 
                await session.send(
                    input=[
                        {"mime_type": "image/jpeg", "data": image_data},
                        "What do you see in this frame?"
                    ],
                    end_of_turn=True
                )
 
                async for response in session.receive():
                    if response.text:
                        print(f"Analysis: {response.text}")
                    if response.server_content and response.server_content.turn_complete:
                        break
 
            frame_count += 1
 
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
 
    cap.release()
 
asyncio.run(realtime_camera_analysis())

5. Best Practices and Implementation Notes

File Size and Rate Limits

The Gemini API has limits on file processing.

Inline data: Up to 20 MB per request
File API: Up to 2 GB per file (stored for 48 hours)
Video token cost: ~1,000 tokens per minute of video (approximate)

Managing Live API Connections

Live API sessions can drop. Implement reconnection logic for production use.

import asyncio
from google import genai
 
MAX_RETRIES = 3
RETRY_DELAY = 2.0
 
async def robust_live_session(prompt: str):
    """Live API session with reconnection logic."""
    client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
    for attempt in range(MAX_RETRIES):
        try:
            async with client.aio.live.connect(
                model="gemini-2.0-flash-live-001",
                config={"response_modalities": ["TEXT"]}
            ) as session:
                await session.send(input=prompt, end_of_turn=True)
 
                async for response in session.receive():
                    if response.text:
                        return response.text
                    if response.server_content and response.server_content.turn_complete:
                        break
 
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < MAX_RETRIES - 1:
                await asyncio.sleep(RETRY_DELAY * (2 ** attempt))
            else:
                raise

Cost Management

Multimodal processing and Live API sessions consume more tokens than text-only requests. A few habits that help manage costs:

Trim video to only the relevant segments before sending
Compress images appropriately before API calls (1 MB is often more than enough)
Don't leave Live API sessions idle — idle time also incurs costs
Use gemini-2.0-flash during development; choose your production model based on actual requirements

Wrapping up

Gemini 2.0's multimodal capabilities and the Live API unlock a category of applications that text-only models simply can't deliver. Whether it's analyzing images and documents, understanding the contents of a video, or holding a real-time voice conversation — these tools let you build experiences that feel natural and intuitive to users.

Start simple: a basic image understanding endpoint is a great first step. Then gradually expand toward live interaction as your application matures.

Membership

Gemini Lab publishes in-depth practical guides to the Gemini API like this one on an ongoing basis.

With a membership, you get full access to:

Deep-dive Gemini API implementation guides (like this article)
Gemini + Firebase and Google Cloud integration case studies
Practical guides to the latest Gemini features: Function Calling, Grounding, Semantic Retrieval, and more
Design patterns and implementation recipes for multimodal applications

New premium content added every week. If you want to get the most out of Gemini, we'd be glad to have you as a member.

Setup and context: What Multimodal AI Changes for Developers

From text-only AI to multimodal AI that combines images, audio, and video — the Gemini API sits at the frontier of this evolution, offering developers a fundamentally new level of expressive power.

Chapter 1: Gemini's Multimodal Architecture

What "Natively Multimodal" Means

Most earlier AI systems processed text, images, and audio through separate specialized models. Gemini is built from the ground up as a natively multimodal model — all modalities are processed in an integrated way.

Traditional architecture:
Input image → Image recognition model → Text description → LLM → Answer

Gemini architecture:
Input image ─┐
Input text  ─┼─→ Gemini (integrated processing) ─→ Answer
Input audio ─┘

This difference is more significant than it might appear. Native integration enables genuine cross-modal understanding. For example, "Quote what the smiling person in this image is saying" can be handled by Gemini by naturally combining visual and linguistic information in a single pass.

Supported Input Types

Input Type	Supported Models	Max Size/Duration
Text	All models	1M+ tokens
Images (JPEG/PNG/WEBP/GIF)	Gemini 1.5+, 2.0	3,600 images/request
Audio (MP3/WAV/FLAC etc.)	Gemini 1.5+, 2.0	9.5 hours
Video (MP4/MOV etc.)	Gemini 1.5+, 2.0	1 hour
PDF / Documents	Gemini 1.5+, 2.0	1,000 pages
Code	All models	Up to context limit

Chapter 2: Image Processing in Practice

Basic Image Input

import google.generativeai as genai
from PIL import Image
import base64
 
genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
 
# Method 1: PIL Image object
image = Image.open("product.jpg")
response = model.generate_content([
    "Analyze this product image and write a 100-word description for an e-commerce listing.",
    image
])
print(response.text)
 
# Method 2: Raw bytes (for files received via API)
with open("product.jpg", "rb") as f:
    image_bytes = f.read()
 
response = model.generate_content([
    "List every object visible in this image.",
    {
        "mime_type": "image/jpeg",
        "data": base64.b64encode(image_bytes).decode()
    }
])

Comparative Analysis of Multiple Images

# Process multiple images in a single call — where multimodal shines
before_image = Image.open("before.jpg")
after_image = Image.open("after.jpg")
 
response = model.generate_content([
    "Compare these two images and provide a detailed analysis of what changed.",
    "Before:", before_image,
    "After:", after_image,
    "Analyze specifically:\n1. Elements added\n2. Elements removed\n3. Elements modified"
])

Extracting Structured Data from Images

import json
 
# Extract structured data from receipts or forms
receipt_image = Image.open("receipt.jpg")
 
response = model.generate_content([
    """Extract data from this receipt in the following JSON format:
    {
      "store_name": "store name",
      "date": "YYYY-MM-DD",
      "items": [
        {"name": "item name", "quantity": quantity, "price": price}
      ],
      "subtotal": subtotal,
      "tax": tax_amount,
      "total": total
    }
    Output JSON only — no surrounding text.""",
    receipt_image
])
 
try:
    data = json.loads(response.text)
    print(f"Total: ${data['total']:.2f}")
except json.JSONDecodeError:
    print(response.text)

Chapter 3: Audio Processing

Transcription and Analysis

import google.generativeai as genai
 
model = genai.GenerativeModel("gemini-2.0-flash")
 
# Upload and process an audio file
audio_file = genai.upload_file("meeting_recording.mp3")
 
response = model.generate_content([
    audio_file,
    """Process this audio in the following format:
    1. Full transcript (distinguish speakers as [Speaker A]:, [Speaker B]:)
    2. Meeting summary (5 points or fewer)
    3. Action items (with owner and deadline)
    4. Suggested agenda items for the next meeting"""
])

Streaming Audio Processing Pattern

async def process_audio_stream(audio_bytes: bytes):
    model = genai.GenerativeModel("gemini-2.0-flash")
 
    response = model.generate_content(
        [
            {"mime_type": "audio/mp3", "data": base64.b64encode(audio_bytes).decode()},
            "Transcribe this audio in real time."
        ],
        stream=True
    )
 
    async for chunk in response:
        if chunk.text:
            yield chunk.text  # Stream output as it arrives

Chapter 4: Video Analysis in Practice

Analyzing Video Content

import time
 
# Upload video via Files API (for larger files)
video_file = genai.upload_file("product_demo.mp4")
 
# Wait for processing to complete
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)
 
if video_file.state.name == "FAILED":
    raise ValueError("Video upload failed")
 
# Analyze the video
model = genai.GenerativeModel("gemini-1.5-pro")  # Pro recommended for long videos
response = model.generate_content([
    video_file,
    """Analyze this product demo video:
    1. List of key features demonstrated (with timestamps)
    2. User interface evaluation
    3. Top 3 improvement suggestions
    4. Standout scenes suitable for marketing materials (with timestamps)"""
])

Combined Video + Text Queries

# Direct attention to a specific scene
response = model.generate_content([
    video_file,
    """Focus on the scene around the 0:45 mark:
    - Describe in detail what's happening
    - Infer the user's emotional state at that moment
    - Provide a concrete improvement suggestion for this scene"""
])

Chapter 5: Designing Integrated Multimodal Systems

Use Case: Automated Product Quality Inspection

class ProductQualityInspector:
    """Quality inspection system integrating product images, spec sheets, and demo video"""
 
    def __init__(self):
        self.model = genai.GenerativeModel("gemini-2.0-flash")
 
    def inspect(
        self,
        product_images: list,
        spec_pdf: str,
        demo_video: str = None
    ) -> dict:
        # Upload spec sheet
        spec_file = genai.upload_file(spec_pdf)
 
        inputs = [spec_file, "Inspect this product against the specification document.\n\n"]
 
        # Add product images
        for i, img_path in enumerate(product_images):
            img = Image.open(img_path)
            inputs.extend([f"\nProduct image {i+1}:", img])
 
        # Add demo video if available
        if demo_video:
            video_file = genai.upload_file(demo_video)
            inputs.extend(["\nOperation demo video:", video_file])
 
        inputs.append("""
        Output inspection results in this JSON format:
        {
          "overall_status": "pass|fail|review_required",
          "score": 0-100,
          "defects": [{"description": "...", "severity": "critical|major|minor", "location": "..."}],
          "spec_compliance": {"compliant": true/false, "deviations": ["deviation items"]},
          "recommendations": ["recommendation 1", ...]
        }
        """)
 
        response = self.model.generate_content(inputs)
        return json.loads(response.text)

Use Case: Multimedia Content Generation Pipeline

async def generate_content_from_media(
    source_video: str,
    target_formats: list[str]
) -> dict:
    """
    Auto-generate multiple content types from a single video:
    - Blog article
    - Social media posts
    - YouTube description
    - Subtitle file (SRT)
    """
    model = genai.GenerativeModel("gemini-1.5-pro")
    video = genai.upload_file(source_video)
 
    results = {}
    prompts = {
        "blog": "Write a ~500-word blog article based on this video's content",
        "sns": "Write 3 variations of a social media post (280 chars max each) capturing this video's appeal",
        "youtube_desc": "Write a YouTube description with chapters, approximately 200 words",
        "srt": "Generate a subtitle file for this video in SRT format"
    }
 
    for format_type in target_formats:
        response = model.generate_content([video, prompts[format_type]])
        results[format_type] = response.text
 
    return results

Chapter 6: Cost Optimization

Model Selection Guide

By task:
- Fast & lightweight (Gemini 2.0 Flash Lite):
  → Simple text generation, basic classification
  Est. cost: ~$0.00001/1K tokens

- Standard (Gemini 2.0 Flash):
  → Everyday document processing, image captioning
  Est. cost: ~$0.00015/1K tokens (text)

- High-precision (Gemini 1.5 Pro):
  → Complex reasoning, long documents, video analysis
  Est. cost: ~$0.00125/1K tokens (up to 128K)

Context Caching Strategy

# When reusing large context repeatedly, caching pays off significantly
cache = genai.caching.CachedContent.create(
    model="gemini-1.5-pro",
    contents=[
        "The following is the full product specification you should use for analysis:\n"
        + open("large_spec.txt").read()
    ],
    ttl=datetime.timedelta(hours=1)
)
 
# Use the cache for multiple queries efficiently
model_with_cache = genai.GenerativeModel.from_cached_content(cache)
for question in questions:
    response = model_with_cache.generate_content(question)
    # Cached context tokens are billed at a lower rate

Batch Processing for Efficiency

async def batch_process_images(image_paths: list[str], task: str) -> list[str]:
    """Process multiple images in parallel"""
    model = genai.GenerativeModel("gemini-2.0-flash")
 
    async def process_one(path: str) -> str:
        img = Image.open(path)
        response = model.generate_content([task, img])
        return response.text
 
    # Parallel execution (max 10 concurrent recommended)
    semaphore = asyncio.Semaphore(10)
    async def limited_process(path):
        async with semaphore:
            return await process_one(path)
 
    results = await asyncio.gather(*[limited_process(p) for p in image_paths])
    return results

Chapter 7: Error Handling and Production Operation

Robust API Client Implementation

import time
from google.api_core import retry, exceptions
 
class RobustGeminiClient:
    def __init__(self, api_key: str, model_name: str = "gemini-2.0-flash"):
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel(model_name)
 
    @retry.Retry(
        predicate=retry.if_exception_type(
            exceptions.ResourceExhausted,   # Rate limit
            exceptions.ServiceUnavailable,  # Transient errors
        ),
        initial=1.0,
        maximum=60.0,
        multiplier=2.0,
        deadline=300.0  # Max 5 minutes of retries
    )
    def generate(self, inputs: list, **kwargs) -> str:
        response = self.model.generate_content(inputs, **kwargs)
        return response.text
 
    def generate_with_fallback(self, inputs: list, fallback_model: str = "gemini-1.5-flash") -> str:
        try:
            return self.generate(inputs)
        except exceptions.InvalidArgument:
            fallback = genai.GenerativeModel(fallback_model)
            return fallback.generate_content(inputs).text

About Gemini Lab Premium Membership

This article is published as a free sample of our premium-quality content. Gemini Lab premium members receive articles at this depth 3 times every day.

What you get as a premium member:

Deep-dive practical guides on Gemini API and Google AI (3 articles/day)
Same-day benchmark reports and comparisons for new models
Code examples and prompt template collections exclusive to members
Real-world Vertex AI / Google Cloud integration case studies

Lifetime Premium: ¥1,480 / Pro Monthly: ¥280

Let's explore what Gemini can become — together.

Gemini 2.0 Multimodal and Live API: The