◈ API / SDK/2026-03-30Advanced

Multimodal RAG with Gemini API — Cross-Format Search over Images, PDFs, and Video

Build a production-grade multimodal RAG pipeline with Gemini 2.5 Pro: unified vector search across text, images, PDFs, and video with cost optimization and scaling patterns.

gemini-api²⁷⁸ multimodal⁴⁴ rag²² embeddings¹¹ production¹⁴⁰ advanced¹⁴

✦ Premium Article

Knowledge You Cannot Search Might As Well Not Exist

Traditional RAG (Retrieval-Augmented Generation) systems only handle text, but real-world knowledge exists in many formats. Design documents in PDF, whiteboard photos, meeting recordings, spreadsheet charts — if your AI assistant can't search across all of these, its practical utility is limited.

Gemini 2.5 Pro provides a multimodal API that processes text, images, PDFs, video, and audio in a single model. Combined with the Embeddings API, you can build a multimodal RAG pipeline that searches documents of any format in a unified vector space.

What follows traces document processing, vector index construction, and the search-generation pipeline in working Python. It assumes familiarity with Function Calling fundamentals — start there if agent tool use is new to you.

Architecture Design

The multimodal RAG pipeline consists of four phases:

Ingest: Accept various file types and split them into processable chunks
Embed: Convert each chunk to a vector using Gemini Embeddings API
Index: Store vectors in a database for fast retrieval
Query: Search for relevant chunks and generate answers with Gemini

# Pipeline overview
# DocumentProcessor → EmbeddingService → VectorStore → QueryEngine
 
from dataclasses import dataclass
from enum import Enum
 
class DocumentType(Enum):
    TEXT = "text"
    PDF = "pdf"
    IMAGE = "image"
    VIDEO = "video"
 
@dataclass
class DocumentChunk:
    """Processed document chunk"""
    chunk_id: str
    source_file: str
    doc_type: DocumentType
    content_text: str          # Text representation for search
    content_description: str   # Gemini-generated description (for images/video)
    metadata: dict             # Page numbers, timestamps, etc.
    embedding: list[float] | None = None

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Master multimodal vector search design and implementation using Gemini Embeddings API

✦Build a document processing pipeline that indexes PDFs, images, and video in a unified vector space

✦Learn concrete caching strategies, cost optimization, and scaling patterns for production deployment

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Document Processing Pipeline

Implement document processors for each file format.

Text and PDF Processing

import google.generativeai as genai
from pathlib import Path
 
genai.configure(api_key="YOUR_GEMINI_API_KEY")
 
class DocumentProcessor:
    """Multimodal document processor"""
 
    def __init__(self, model_name: str = "gemini-2.5-pro-preview-05-06"):
        self.model = genai.GenerativeModel(model_name)
        self.chunk_size = 1000  # Characters per chunk
        self.chunk_overlap = 200
 
    def process_text(self, text: str, source: str) -> list[DocumentChunk]:
        """Split text into overlapping chunks"""
        chunks = []
        start = 0
        chunk_idx = 0
 
        while start < len(text):
            end = min(start + self.chunk_size, len(text))
            chunk_text = text[start:end]
 
            chunks.append(DocumentChunk(
                chunk_id=f"{source}__chunk_{chunk_idx}",
                source_file=source,
                doc_type=DocumentType.TEXT,
                content_text=chunk_text,
                content_description=chunk_text[:200],
                metadata={"chunk_index": chunk_idx, "char_start": start},
            ))
 
            start += self.chunk_size - self.chunk_overlap
            chunk_idx += 1
 
        return chunks
 
    async def process_pdf(self, pdf_path: str) -> list[DocumentChunk]:
        """Analyze PDF with Gemini Vision and split into page-level chunks"""
        pdf_file = genai.upload_file(pdf_path)
 
        # Extract per-page content using Gemini
        response = await self.model.generate_content_async([
            pdf_file,
            "Transcribe each page of this PDF in detail, including page numbers. "
            "If there are figures or tables, describe their contents. "
            "Separate each page with '--- Page N ---'."
        ])
 
        # Create chunks per page
        pages = response.text.split("--- Page ")
        chunks = []
 
        for page in pages:
            if not page.strip():
                continue
 
            lines = page.strip().split("\n", 1)
            page_num = lines[0].replace("---", "").strip()
            page_text = lines[1] if len(lines) > 1 else ""
 
            if page_text.strip():
                chunks.append(DocumentChunk(
                    chunk_id=f"{pdf_path}__page_{page_num}",
                    source_file=pdf_path,
                    doc_type=DocumentType.PDF,
                    content_text=page_text,
                    content_description=page_text[:200],
                    metadata={"page_number": page_num},
                ))
 
        return chunks

Image and Video Processing

    async def process_image(self, image_path: str) -> list[DocumentChunk]:
        """Analyze image with Gemini Vision to generate text representation"""
        image_file = genai.upload_file(image_path)
 
        response = await self.model.generate_content_async([
            image_file,
            "Describe this image in detail. If it contains text, include OCR results. "
            "If it's a chart or graph, summarize the data it represents."
        ])
 
        return [DocumentChunk(
            chunk_id=f"{image_path}__full",
            source_file=image_path,
            doc_type=DocumentType.IMAGE,
            content_text=response.text,
            content_description=response.text[:200],
            metadata={"image_path": image_path},
        )]
 
    async def process_video(
        self, video_path: str, interval_seconds: int = 30
    ) -> list[DocumentChunk]:
        """Analyze video in segments"""
        video_file = genai.upload_file(video_path)
 
        response = await self.model.generate_content_async([
            video_file,
            f"Divide this video into {interval_seconds}-second segments "
            "and describe each segment in detail. "
            "Format: '[MM:SS - MM:SS] Description' for each segment."
        ])
 
        chunks = []
        segments = response.text.strip().split("\n")
 
        for idx, segment in enumerate(segments):
            if segment.strip():
                chunks.append(DocumentChunk(
                    chunk_id=f"{video_path}__segment_{idx}",
                    source_file=video_path,
                    doc_type=DocumentType.VIDEO,
                    content_text=segment,
                    content_description=segment[:200],
                    metadata={"segment_index": idx},
                ))
 
        return chunks

Vector Embedding with Gemini Embeddings API

Convert all chunks into vector space using the Gemini Embeddings API.

class EmbeddingService:
    """Gemini Embeddings API wrapper"""
 
    def __init__(self, model: str = "models/text-embedding-004"):
        self.model = model
        self.batch_size = 100  # API batch limit
 
    async def embed_chunks(
        self, chunks: list[DocumentChunk]
    ) -> list[DocumentChunk]:
        """Add embeddings to a list of chunks"""
        texts = [chunk.content_text for chunk in chunks]
 
        # Batch processing
        all_embeddings = []
        for i in range(0, len(texts), self.batch_size):
            batch = texts[i:i + self.batch_size]
            result = genai.embed_content(
                model=self.model,
                content=batch,
                task_type="retrieval_document",
            )
            all_embeddings.extend(result["embedding"])
 
        for chunk, embedding in zip(chunks, all_embeddings):
            chunk.embedding = embedding
 
        return chunks
 
    async def embed_query(self, query: str) -> list[float]:
        """Convert a search query to an embedding"""
        result = genai.embed_content(
            model=self.model,
            content=query,
            task_type="retrieval_query",  # Query-specific task type
        )
        return result["embedding"]

Vector Store and Search Engine

Implement cosine similarity search. In production, use a dedicated vector DB like Pinecone, Weaviate, or pgvector — here we use NumPy to illustrate the principles.

import numpy as np
from typing import Optional
 
class VectorStore:
    """In-memory vector store (use pgvector etc. in production)"""
 
    def __init__(self):
        self.chunks: list[DocumentChunk] = []
        self.embeddings: np.ndarray | None = None
 
    def add(self, chunks: list[DocumentChunk]):
        """Add chunks to the store"""
        self.chunks.extend(chunks)
        vectors = [c.embedding for c in self.chunks if c.embedding]
        self.embeddings = np.array(vectors)
 
    def search(
        self,
        query_embedding: list[float],
        top_k: int = 5,
        doc_type_filter: Optional[DocumentType] = None,
    ) -> list[tuple[DocumentChunk, float]]:
        """Search by cosine similarity"""
        if self.embeddings is None or len(self.embeddings) == 0:
            return []
 
        query_vec = np.array(query_embedding)
        similarities = np.dot(self.embeddings, query_vec) / (
            np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_vec)
        )
 
        results = []
        for idx in np.argsort(similarities)[::-1]:
            chunk = self.chunks[idx]
            if doc_type_filter and chunk.doc_type != doc_type_filter:
                continue
            results.append((chunk, float(similarities[idx])))
            if len(results) >= top_k:
                break
 
        return results

Search and Generation Pipeline (RAG Query Engine)

Pass search results as context to Gemini for answer generation.

class RAGQueryEngine:
    """Multimodal RAG query engine"""
 
    def __init__(
        self,
        embedding_service: EmbeddingService,
        vector_store: VectorStore,
        model_name: str = "gemini-2.5-pro-preview-05-06",
    ):
        self.embedding_service = embedding_service
        self.vector_store = vector_store
        self.model = genai.GenerativeModel(model_name)
 
    async def query(
        self,
        question: str,
        top_k: int = 5,
        doc_type_filter: Optional[DocumentType] = None,
    ) -> dict:
        """Execute RAG search + generation for a question"""
 
        # Step 1: Embed the query
        query_embedding = await self.embedding_service.embed_query(question)
 
        # Step 2: Vector search
        results = self.vector_store.search(
            query_embedding, top_k=top_k,
            doc_type_filter=doc_type_filter,
        )
 
        if not results:
            return {
                "answer": "No relevant documents found.",
                "sources": [],
            }
 
        # Step 3: Build context
        context_parts = []
        sources = []
        for chunk, score in results:
            context_parts.append(
                f"[Source: {chunk.source_file} | Type: {chunk.doc_type.value} | "
                f"Score: {score:.3f}]\n{chunk.content_text}"
            )
            sources.append({
                "file": chunk.source_file,
                "type": chunk.doc_type.value,
                "relevance": round(score, 3),
                "excerpt": chunk.content_description,
            })
 
        context = "\n\n---\n\n".join(context_parts)
 
        # Step 4: Generate answer with Gemini
        prompt = f"""Based on the search results below, answer the user's question accurately.
Cite the sources you used. If you need to speculate beyond the search results,
clearly state that you are doing so.
 
[Search Results]
{context}
 
[Question]
{question}"""
 
        response = await self.model.generate_content_async(prompt)
 
        return {
            "answer": response.text,
            "sources": sources,
            "tokens_used": response.usage_metadata.total_token_count,
        }
 
# Usage
# engine = RAGQueryEngine(embedding_service, vector_store)
# result = await engine.query("Which region grew the most in last month's sales report chart?")
# print(result["answer"])
# print(f"Sources: {result['sources']}")

For leveraging context caching to reduce costs, see the Context Caching Guide.

Cost Optimization and Scaling

Batch Processing and Caching Strategies

Embedding cache: Persist generated embeddings in your database and only recompute when documents are updated.
Context Caching: Use Gemini's Context Caching API when including large sets of document chunks as context, reducing costs for repeated queries.
Model tiering: Use text-embedding-004 for embeddings, gemini-2.5-flash for initial relevance filtering, and gemini-2.5-pro for final answer generation.

# Cost-optimized query: Flash for filtering → Pro for final answer
async def cost_optimized_query(engine: RAGQueryEngine, question: str):
    # Step 1: Retrieve more candidates with Flash
    flash_model = genai.GenerativeModel("gemini-2.5-flash-preview-04-17")
    results = engine.vector_store.search(
        await engine.embedding_service.embed_query(question),
        top_k=10  # Retrieve more candidates
    )
 
    # Step 2: Rerank with Flash (cheaper)
    rerank_prompt = f"From the following document fragments, select the top 3 most relevant to the question: '{question}'"
    # ... abbreviated
 
    # Step 3: Final answer with Pro (high quality, higher cost)
    # Use only the filtered 3 results as context

A Note from an Indie Developer

Key Takeaways

Leveraging Gemini API's multimodal capabilities, you can build a RAG pipeline that searches and analyzes text, images, PDFs, and video in a unified system. The four-phase architecture (Ingest → Embed → Index → Query) combined with model tiering for cost optimization creates a practical and scalable system.

Start with a small pipeline handling just text and PDFs, then incrementally add image and video support. For related concepts, also check out the Gemini 2.5 Pro Extended Thinking Overview.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.