Gemini 2.0 Multimodal and Live API: The Complete Guide
Gemini 2.0 was designed from the ground up as a genuinely multimodal AI — not just a language model with vision tacked on. It understands images, video, and audio simultaneously, and with the Live API, it supports real-time, bidirectional streaming conversations. This opens up a category of applications that simply wasn't possible with earlier generation models.
1. Multimodal Processing in Gemini 2.0
Supported Input Modalities
Gemini 2.0 Flash (and Gemini 2.0 Pro) support the following input types:
- Text: Natural language, code, structured data
- Images: JPEG, PNG, WebP, GIF (static)
- Video: MP4, MOV, AVI, and other major formats (up to 1 hour)
- Audio: MP3, WAV, FLAC, AAC, and more
- PDF: Document understanding including text extraction
These can be sent individually or in any combination in a single request.
Image Understanding
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
# Read a local image file
image_path = Path("screenshot.png")
image_data = image_path.read_bytes()
response = model.generate_content([
{
"mime_type": "image/png",
"data": image_data
},
"Analyze this screenshot's UI and identify five specific UX improvements."
])
print(response.text)Using Base64 encoding
import base64
with open("chart.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = model.generate_content([
{
"inline_data": {
"mime_type": "image/jpeg",
"data": image_data
}
},
"Extract the data from this chart in tabular format, then analyze the key trends."
])Video Understanding
Gemini 2.0 can process video files directly — summarizing content, finding specific scenes, or answering questions about what's in the video.
# Upload and process a video using the File API
import google.generativeai as genai
import time
genai.configure(api_key="YOUR_GEMINI_API_KEY")
# Upload the video file
print("Uploading video...")
video_file = genai.upload_file("tutorial_video.mp4")
# Wait for processing to complete
while video_file.state.name == "PROCESSING":
print("Processing...")
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError("Video upload failed")
# Query the video
model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content([
video_file,
"Please summarize this video in the following format:\n"
"1. Main theme\n"
"2. Key points (3–5)\n"
"3. Conclusion or next actions"
])
print(response.text)
# Clean up the uploaded file
genai.delete_file(video_file.name)Audio Processing
# Transcribe and analyze an audio recording
with open("meeting_recording.mp3", "rb") as f:
audio_data = f.read()
response = model.generate_content([
{
"inline_data": {
"mime_type": "audio/mp3",
"data": base64.b64encode(audio_data).decode("utf-8")
}
},
"Transcribe this meeting recording and extract:\n"
"1. Decisions made\n"
"2. Action items (with owner and deadline)\n"
"3. Open questions"
])2. Combined Multimodal Requests
Processing Multiple Images in One Request
Gemini 2.0 can receive and reason about multiple images in a single request.
# Compare two images
before_image = Path("before.png").read_bytes()
after_image = Path("after.png").read_bytes()
response = model.generate_content([
"Before image:",
{"mime_type": "image/png", "data": before_image},
"After image:",
{"mime_type": "image/png", "data": after_image},
"Compare these two images in detail. List every design change, addition, and removal you can identify."
])Image + Text + Structured Output
import json
# Extract structured data from a product image
with open("product.jpg", "rb") as f:
product_image = base64.b64encode(f.read()).decode("utf-8")
response = model.generate_content(
contents=[
{
"inline_data": {
"mime_type": "image/jpeg",
"data": product_image
}
},
"""Extract information from this product image in the following JSON format:
{
"product_name": "product name",
"category": "category",
"colors": ["color1", "color2"],
"estimated_price_range": "price range (e.g., $10–$50)",
"features": ["feature1", "feature2", "feature3"],
"target_audience": "target customer",
"condition": "new or used"
}
Output JSON only."""
],
generation_config=genai.GenerationConfig(
response_mime_type="application/json"
)
)
product_data = json.loads(response.text)
print(product_data)3. Real-Time Streaming with the Live API
What Is the Live API?
The Live API enables bidirectional real-time streaming with Gemini 2.0. It supports streaming audio, text, and images with low latency, making it possible to build truly conversational, real-time multimodal applications.
Key characteristics:
- WebSocket-based bidirectional communication
- Real-time voice input and voice output
- Real-time processing of screen share or camera feeds
- Support for user interruptions
Text Streaming
Start with the simplest case: streaming text responses.
# Standard streaming with the REST API
model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content(
"Explain how quantum computers work in detail.",
stream=True
)
for chunk in response:
if chunk.text:
print(chunk.text, end="", flush=True)
print()Live API (WebSocket)
The Live API provides more advanced real-time capabilities.
import asyncio
from google import genai
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
async def live_conversation():
"""Real-time conversation using the Live API."""
config = {
"response_modalities": ["TEXT"],
"system_instruction": "You are a helpful and friendly assistant."
}
async with client.aio.live.connect(
model="gemini-2.0-flash-live-001",
config=config
) as session:
print("Live API session started")
# Send a text message
await session.send(input="Hello! Can you tell me about today's weather?", end_of_turn=True)
# Receive the response
async for response in session.receive():
if response.text:
print(f"Gemini: {response.text}", end="", flush=True)
if response.server_content and response.server_content.turn_complete:
print()
break
asyncio.run(live_conversation())Voice-In, Voice-Out with the Live API
import asyncio
import pyaudio
from google import genai
from google.genai import types
SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK = 1024
async def voice_conversation():
"""Real-time voice conversation using the Live API."""
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
config = types.LiveConnectConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name="Kore"
)
)
)
)
audio = pyaudio.PyAudio()
# Microphone input stream
input_stream = audio.open(
format=pyaudio.paInt16,
channels=CHANNELS,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=CHUNK
)
# Speaker output stream
output_stream = audio.open(
format=pyaudio.paInt16,
channels=CHANNELS,
rate=24000,
output=True
)
async with client.aio.live.connect(
model="gemini-2.0-flash-live-001",
config=config
) as session:
async def send_audio():
"""Stream microphone audio to Gemini."""
while True:
data = input_stream.read(CHUNK, exception_on_overflow=False)
await session.send(input={"data": data, "mime_type": "audio/pcm"})
await asyncio.sleep(0.01)
async def receive_audio():
"""Play back audio from Gemini."""
async for response in session.receive():
if response.data:
output_stream.write(response.data)
# Run send and receive in parallel
await asyncio.gather(
send_audio(),
receive_audio()
)
asyncio.run(voice_conversation())4. Combining Real-Time Image Processing
Real-Time Camera Analysis
Combining the Live API with camera input enables real-time visual understanding.
import asyncio
import cv2
import base64
from google import genai
from google.genai import types
async def realtime_camera_analysis():
"""Continuously analyze camera frames with Gemini."""
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
config = types.LiveConnectConfig(
response_modalities=["TEXT"],
system_instruction="Continuously analyze what you see in the camera feed. "
"Report any significant changes."
)
cap = cv2.VideoCapture(0)
async with client.aio.live.connect(
model="gemini-2.0-flash-live-001",
config=config
) as session:
frame_count = 0
while True:
ret, frame = cap.read()
if not ret:
break
# Send one frame per second (frame rate control)
if frame_count % 30 == 0:
_, buffer = cv2.imencode(".jpg", frame)
image_data = base64.b64encode(buffer).decode("utf-8")
await session.send(
input=[
{"mime_type": "image/jpeg", "data": image_data},
"What do you see in this frame?"
],
end_of_turn=True
)
async for response in session.receive():
if response.text:
print(f"Analysis: {response.text}")
if response.server_content and response.server_content.turn_complete:
break
frame_count += 1
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
asyncio.run(realtime_camera_analysis())5. Best Practices and Implementation Notes
File Size and Rate Limits
The Gemini API has limits on file processing.
- Inline data: Up to 20 MB per request
- File API: Up to 2 GB per file (stored for 48 hours)
- Video token cost: ~1,000 tokens per minute of video (approximate)
Managing Live API Connections
Live API sessions can drop. Implement reconnection logic for production use.
import asyncio
from google import genai
MAX_RETRIES = 3
RETRY_DELAY = 2.0
async def robust_live_session(prompt: str):
"""Live API session with reconnection logic."""
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
for attempt in range(MAX_RETRIES):
try:
async with client.aio.live.connect(
model="gemini-2.0-flash-live-001",
config={"response_modalities": ["TEXT"]}
) as session:
await session.send(input=prompt, end_of_turn=True)
async for response in session.receive():
if response.text:
return response.text
if response.server_content and response.server_content.turn_complete:
break
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < MAX_RETRIES - 1:
await asyncio.sleep(RETRY_DELAY * (2 ** attempt))
else:
raiseCost Management
Multimodal processing and Live API sessions consume more tokens than text-only requests. A few habits that help manage costs:
- Trim video to only the relevant segments before sending
- Compress images appropriately before API calls (1 MB is often more than enough)
- Don't leave Live API sessions idle — idle time also incurs costs
- Use
gemini-2.0-flashduring development; choose your production model based on actual requirements
Wrapping up
Gemini 2.0's multimodal capabilities and the Live API unlock a category of applications that text-only models simply can't deliver. Whether it's analyzing images and documents, understanding the contents of a video, or holding a real-time voice conversation — these tools let you build experiences that feel natural and intuitive to users.
Start simple: a basic image understanding endpoint is a great first step. Then gradually expand toward live interaction as your application matures.
Membership
Gemini Lab publishes in-depth practical guides to the Gemini API like this one on an ongoing basis.
With a membership, you get full access to:
- Deep-dive Gemini API implementation guides (like this article)
- Gemini + Firebase and Google Cloud integration case studies
- Practical guides to the latest Gemini features: Function Calling, Grounding, Semantic Retrieval, and more
- Design patterns and implementation recipes for multimodal applications
New premium content added every week. If you want to get the most out of Gemini, we'd be glad to have you as a member.
Setup and context: What Multimodal AI Changes for Developers
From text-only AI to multimodal AI that combines images, audio, and video — the Gemini API sits at the frontier of this evolution, offering developers a fundamentally new level of expressive power.
Chapter 1: Gemini's Multimodal Architecture
What "Natively Multimodal" Means
Most earlier AI systems processed text, images, and audio through separate specialized models. Gemini is built from the ground up as a natively multimodal model — all modalities are processed in an integrated way.
Traditional architecture:
Input image → Image recognition model → Text description → LLM → Answer
Gemini architecture:
Input image ─┐
Input text ─┼─→ Gemini (integrated processing) ─→ Answer
Input audio ─┘
This difference is more significant than it might appear. Native integration enables genuine cross-modal understanding. For example, "Quote what the smiling person in this image is saying" can be handled by Gemini by naturally combining visual and linguistic information in a single pass.
Supported Input Types
| Input Type | Supported Models | Max Size/Duration | |-----------|-----------------|------------------| | Text | All models | 1M+ tokens | | Images (JPEG/PNG/WEBP/GIF) | Gemini 1.5+, 2.0 | 3,600 images/request | | Audio (MP3/WAV/FLAC etc.) | Gemini 1.5+, 2.0 | 9.5 hours | | Video (MP4/MOV etc.) | Gemini 1.5+, 2.0 | 1 hour | | PDF / Documents | Gemini 1.5+, 2.0 | 1,000 pages | | Code | All models | Up to context limit |
Chapter 2: Image Processing in Practice
Basic Image Input
import google.generativeai as genai
from PIL import Image
import base64
genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
# Method 1: PIL Image object
image = Image.open("product.jpg")
response = model.generate_content([
"Analyze this product image and write a 100-word description for an e-commerce listing.",
image
])
print(response.text)
# Method 2: Raw bytes (for files received via API)
with open("product.jpg", "rb") as f:
image_bytes = f.read()
response = model.generate_content([
"List every object visible in this image.",
{
"mime_type": "image/jpeg",
"data": base64.b64encode(image_bytes).decode()
}
])Comparative Analysis of Multiple Images
# Process multiple images in a single call — where multimodal shines
before_image = Image.open("before.jpg")
after_image = Image.open("after.jpg")
response = model.generate_content([
"Compare these two images and provide a detailed analysis of what changed.",
"Before:", before_image,
"After:", after_image,
"Analyze specifically:\n1. Elements added\n2. Elements removed\n3. Elements modified"
])Extracting Structured Data from Images
import json
# Extract structured data from receipts or forms
receipt_image = Image.open("receipt.jpg")
response = model.generate_content([
"""Extract data from this receipt in the following JSON format:
{
"store_name": "store name",
"date": "YYYY-MM-DD",
"items": [
{"name": "item name", "quantity": quantity, "price": price}
],
"subtotal": subtotal,
"tax": tax_amount,
"total": total
}
Output JSON only — no surrounding text.""",
receipt_image
])
try:
data = json.loads(response.text)
print(f"Total: ${data['total']:.2f}")
except json.JSONDecodeError:
print(response.text)Chapter 3: Audio Processing
Transcription and Analysis
import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.0-flash")
# Upload and process an audio file
audio_file = genai.upload_file("meeting_recording.mp3")
response = model.generate_content([
audio_file,
"""Process this audio in the following format:
1. Full transcript (distinguish speakers as [Speaker A]:, [Speaker B]:)
2. Meeting summary (5 points or fewer)
3. Action items (with owner and deadline)
4. Suggested agenda items for the next meeting"""
])Streaming Audio Processing Pattern
async def process_audio_stream(audio_bytes: bytes):
model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content(
[
{"mime_type": "audio/mp3", "data": base64.b64encode(audio_bytes).decode()},
"Transcribe this audio in real time."
],
stream=True
)
async for chunk in response:
if chunk.text:
yield chunk.text # Stream output as it arrivesChapter 4: Video Analysis in Practice
Analyzing Video Content
import time
# Upload video via Files API (for larger files)
video_file = genai.upload_file("product_demo.mp4")
# Wait for processing to complete
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError("Video upload failed")
# Analyze the video
model = genai.GenerativeModel("gemini-1.5-pro") # Pro recommended for long videos
response = model.generate_content([
video_file,
"""Analyze this product demo video:
1. List of key features demonstrated (with timestamps)
2. User interface evaluation
3. Top 3 improvement suggestions
4. Standout scenes suitable for marketing materials (with timestamps)"""
])Combined Video + Text Queries
# Direct attention to a specific scene
response = model.generate_content([
video_file,
"""Focus on the scene around the 0:45 mark:
- Describe in detail what's happening
- Infer the user's emotional state at that moment
- Provide a concrete improvement suggestion for this scene"""
])Chapter 5: Designing Integrated Multimodal Systems
Use Case: Automated Product Quality Inspection
class ProductQualityInspector:
"""Quality inspection system integrating product images, spec sheets, and demo video"""
def __init__(self):
self.model = genai.GenerativeModel("gemini-2.0-flash")
def inspect(
self,
product_images: list,
spec_pdf: str,
demo_video: str = None
) -> dict:
# Upload spec sheet
spec_file = genai.upload_file(spec_pdf)
inputs = [spec_file, "Inspect this product against the specification document.\n\n"]
# Add product images
for i, img_path in enumerate(product_images):
img = Image.open(img_path)
inputs.extend([f"\nProduct image {i+1}:", img])
# Add demo video if available
if demo_video:
video_file = genai.upload_file(demo_video)
inputs.extend(["\nOperation demo video:", video_file])
inputs.append("""
Output inspection results in this JSON format:
{
"overall_status": "pass|fail|review_required",
"score": 0-100,
"defects": [{"description": "...", "severity": "critical|major|minor", "location": "..."}],
"spec_compliance": {"compliant": true/false, "deviations": ["deviation items"]},
"recommendations": ["recommendation 1", ...]
}
""")
response = self.model.generate_content(inputs)
return json.loads(response.text)Use Case: Multimedia Content Generation Pipeline
async def generate_content_from_media(
source_video: str,
target_formats: list[str]
) -> dict:
"""
Auto-generate multiple content types from a single video:
- Blog article
- Social media posts
- YouTube description
- Subtitle file (SRT)
"""
model = genai.GenerativeModel("gemini-1.5-pro")
video = genai.upload_file(source_video)
results = {}
prompts = {
"blog": "Write a ~500-word blog article based on this video's content",
"sns": "Write 3 variations of a social media post (280 chars max each) capturing this video's appeal",
"youtube_desc": "Write a YouTube description with chapters, approximately 200 words",
"srt": "Generate a subtitle file for this video in SRT format"
}
for format_type in target_formats:
response = model.generate_content([video, prompts[format_type]])
results[format_type] = response.text
return resultsChapter 6: Cost Optimization
Model Selection Guide
By task:
- Fast & lightweight (Gemini 2.0 Flash Lite):
→ Simple text generation, basic classification
Est. cost: ~$0.00001/1K tokens
- Standard (Gemini 2.0 Flash):
→ Everyday document processing, image captioning
Est. cost: ~$0.00015/1K tokens (text)
- High-precision (Gemini 1.5 Pro):
→ Complex reasoning, long documents, video analysis
Est. cost: ~$0.00125/1K tokens (up to 128K)
Context Caching Strategy
# When reusing large context repeatedly, caching pays off significantly
cache = genai.caching.CachedContent.create(
model="gemini-1.5-pro",
contents=[
"The following is the full product specification you should use for analysis:\n"
+ open("large_spec.txt").read()
],
ttl=datetime.timedelta(hours=1)
)
# Use the cache for multiple queries efficiently
model_with_cache = genai.GenerativeModel.from_cached_content(cache)
for question in questions:
response = model_with_cache.generate_content(question)
# Cached context tokens are billed at a lower rateBatch Processing for Efficiency
async def batch_process_images(image_paths: list[str], task: str) -> list[str]:
"""Process multiple images in parallel"""
model = genai.GenerativeModel("gemini-2.0-flash")
async def process_one(path: str) -> str:
img = Image.open(path)
response = model.generate_content([task, img])
return response.text
# Parallel execution (max 10 concurrent recommended)
semaphore = asyncio.Semaphore(10)
async def limited_process(path):
async with semaphore:
return await process_one(path)
results = await asyncio.gather(*[limited_process(p) for p in image_paths])
return resultsChapter 7: Error Handling and Production Operation
Robust API Client Implementation
import time
from google.api_core import retry, exceptions
class RobustGeminiClient:
def __init__(self, api_key: str, model_name: str = "gemini-2.0-flash"):
genai.configure(api_key=api_key)
self.model = genai.GenerativeModel(model_name)
@retry.Retry(
predicate=retry.if_exception_type(
exceptions.ResourceExhausted, # Rate limit
exceptions.ServiceUnavailable, # Transient errors
),
initial=1.0,
maximum=60.0,
multiplier=2.0,
deadline=300.0 # Max 5 minutes of retries
)
def generate(self, inputs: list, **kwargs) -> str:
response = self.model.generate_content(inputs, **kwargs)
return response.text
def generate_with_fallback(self, inputs: list, fallback_model: str = "gemini-1.5-flash") -> str:
try:
return self.generate(inputs)
except exceptions.InvalidArgument:
fallback = genai.GenerativeModel(fallback_model)
return fallback.generate_content(inputs).textAbout Gemini Lab Premium Membership
This article is published as a free sample of our premium-quality content. Gemini Lab premium members receive articles at this depth 3 times every day.
What you get as a premium member:
- Deep-dive practical guides on Gemini API and Google AI (3 articles/day)
- Same-day benchmark reports and comparisons for new models
- Code examples and prompt template collections exclusive to members
- Real-world Vertex AI / Google Cloud integration case studies
Lifetime Premium: ¥1,480 / Pro Monthly: ¥280
Let's explore what Gemini can become — together.