●SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMA●FLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasks●IMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25●MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxes●FILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2●DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon●SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMA●FLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasks●IMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25●MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxes●FILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2●DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
Gemini API Multimodal RAG Pipeline Production Guide— Building Cross-Format Search with Images, PDFs, and Video
Build a production-grade multimodal RAG pipeline with Gemini 2.5 Pro: unified vector search across text, images, PDFs, and video with cost optimization and scaling patterns.
Traditional RAG (Retrieval-Augmented Generation) systems only handle text, but real-world knowledge exists in many formats. Design documents in PDF, whiteboard photos, meeting recordings, spreadsheet charts — if your AI assistant can't search across all of these, its practical utility is limited.
Gemini 2.5 Pro provides a multimodal API that processes text, images, PDFs, video, and audio in a single model. Combined with the Embeddings API, you can build a multimodal RAG pipeline that searches documents of any format in a unified vector space.
This guide walks through document processing, vector index construction, and the search-generation pipeline with working Python code. We assume familiarity with Function Calling fundamentals — start there if you're new to agent tool use.
Architecture Design
The multimodal RAG pipeline consists of four phases:
Ingest: Accept various file types and split them into processable chunks
Embed: Convert each chunk to a vector using Gemini Embeddings API
Index: Store vectors in a database for fast retrieval
Query: Search for relevant chunks and generate answers with Gemini
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Master multimodal vector search design and implementation using Gemini Embeddings API
✦Build a document processing pipeline that indexes PDFs, images, and video in a unified vector space
✦Learn concrete caching strategies, cost optimization, and scaling patterns for production deployment
Secure payment via Stripe · Cancel anytime
Document Processing Pipeline
Implement document processors for each file format.
Text and PDF Processing
import google.generativeai as genaifrom pathlib import Pathgenai.configure(api_key="YOUR_GEMINI_API_KEY")class DocumentProcessor: """Multimodal document processor""" def __init__(self, model_name: str = "gemini-2.5-pro-preview-05-06"): self.model = genai.GenerativeModel(model_name) self.chunk_size = 1000 # Characters per chunk self.chunk_overlap = 200 def process_text(self, text: str, source: str) -> list[DocumentChunk]: """Split text into overlapping chunks""" chunks = [] start = 0 chunk_idx = 0 while start < len(text): end = min(start + self.chunk_size, len(text)) chunk_text = text[start:end] chunks.append(DocumentChunk( chunk_id=f"{source}__chunk_{chunk_idx}", source_file=source, doc_type=DocumentType.TEXT, content_text=chunk_text, content_description=chunk_text[:200], metadata={"chunk_index": chunk_idx, "char_start": start}, )) start += self.chunk_size - self.chunk_overlap chunk_idx += 1 return chunks async def process_pdf(self, pdf_path: str) -> list[DocumentChunk]: """Analyze PDF with Gemini Vision and split into page-level chunks""" pdf_file = genai.upload_file(pdf_path) # Extract per-page content using Gemini response = await self.model.generate_content_async([ pdf_file, "Transcribe each page of this PDF in detail, including page numbers. " "If there are figures or tables, describe their contents. " "Separate each page with '--- Page N ---'." ]) # Create chunks per page pages = response.text.split("--- Page ") chunks = [] for page in pages: if not page.strip(): continue lines = page.strip().split("\n", 1) page_num = lines[0].replace("---", "").strip() page_text = lines[1] if len(lines) > 1 else "" if page_text.strip(): chunks.append(DocumentChunk( chunk_id=f"{pdf_path}__page_{page_num}", source_file=pdf_path, doc_type=DocumentType.PDF, content_text=page_text, content_description=page_text[:200], metadata={"page_number": page_num}, )) return chunks
Image and Video Processing
async def process_image(self, image_path: str) -> list[DocumentChunk]: """Analyze image with Gemini Vision to generate text representation""" image_file = genai.upload_file(image_path) response = await self.model.generate_content_async([ image_file, "Describe this image in detail. If it contains text, include OCR results. " "If it's a chart or graph, summarize the data it represents." ]) return [DocumentChunk( chunk_id=f"{image_path}__full", source_file=image_path, doc_type=DocumentType.IMAGE, content_text=response.text, content_description=response.text[:200], metadata={"image_path": image_path}, )] async def process_video( self, video_path: str, interval_seconds: int = 30 ) -> list[DocumentChunk]: """Analyze video in segments""" video_file = genai.upload_file(video_path) response = await self.model.generate_content_async([ video_file, f"Divide this video into {interval_seconds}-second segments " "and describe each segment in detail. " "Format: '[MM:SS - MM:SS] Description' for each segment." ]) chunks = [] segments = response.text.strip().split("\n") for idx, segment in enumerate(segments): if segment.strip(): chunks.append(DocumentChunk( chunk_id=f"{video_path}__segment_{idx}", source_file=video_path, doc_type=DocumentType.VIDEO, content_text=segment, content_description=segment[:200], metadata={"segment_index": idx}, )) return chunks
Vector Embedding with Gemini Embeddings API
Convert all chunks into vector space using the Gemini Embeddings API.
class EmbeddingService: """Gemini Embeddings API wrapper""" def __init__(self, model: str = "models/text-embedding-004"): self.model = model self.batch_size = 100 # API batch limit async def embed_chunks( self, chunks: list[DocumentChunk] ) -> list[DocumentChunk]: """Add embeddings to a list of chunks""" texts = [chunk.content_text for chunk in chunks] # Batch processing all_embeddings = [] for i in range(0, len(texts), self.batch_size): batch = texts[i:i + self.batch_size] result = genai.embed_content( model=self.model, content=batch, task_type="retrieval_document", ) all_embeddings.extend(result["embedding"]) for chunk, embedding in zip(chunks, all_embeddings): chunk.embedding = embedding return chunks async def embed_query(self, query: str) -> list[float]: """Convert a search query to an embedding""" result = genai.embed_content( model=self.model, content=query, task_type="retrieval_query", # Query-specific task type ) return result["embedding"]
Vector Store and Search Engine
Implement cosine similarity search. In production, use a dedicated vector DB like Pinecone, Weaviate, or pgvector — here we use NumPy to illustrate the principles.
import numpy as npfrom typing import Optionalclass VectorStore: """In-memory vector store (use pgvector etc. in production)""" def __init__(self): self.chunks: list[DocumentChunk] = [] self.embeddings: np.ndarray | None = None def add(self, chunks: list[DocumentChunk]): """Add chunks to the store""" self.chunks.extend(chunks) vectors = [c.embedding for c in self.chunks if c.embedding] self.embeddings = np.array(vectors) def search( self, query_embedding: list[float], top_k: int = 5, doc_type_filter: Optional[DocumentType] = None, ) -> list[tuple[DocumentChunk, float]]: """Search by cosine similarity""" if self.embeddings is None or len(self.embeddings) == 0: return [] query_vec = np.array(query_embedding) similarities = np.dot(self.embeddings, query_vec) / ( np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_vec) ) results = [] for idx in np.argsort(similarities)[::-1]: chunk = self.chunks[idx] if doc_type_filter and chunk.doc_type != doc_type_filter: continue results.append((chunk, float(similarities[idx]))) if len(results) >= top_k: break return results
Search and Generation Pipeline (RAG Query Engine)
Pass search results as context to Gemini for answer generation.
class RAGQueryEngine: """Multimodal RAG query engine""" def __init__( self, embedding_service: EmbeddingService, vector_store: VectorStore, model_name: str = "gemini-2.5-pro-preview-05-06", ): self.embedding_service = embedding_service self.vector_store = vector_store self.model = genai.GenerativeModel(model_name) async def query( self, question: str, top_k: int = 5, doc_type_filter: Optional[DocumentType] = None, ) -> dict: """Execute RAG search + generation for a question""" # Step 1: Embed the query query_embedding = await self.embedding_service.embed_query(question) # Step 2: Vector search results = self.vector_store.search( query_embedding, top_k=top_k, doc_type_filter=doc_type_filter, ) if not results: return { "answer": "No relevant documents found.", "sources": [], } # Step 3: Build context context_parts = [] sources = [] for chunk, score in results: context_parts.append( f"[Source: {chunk.source_file} | Type: {chunk.doc_type.value} | " f"Score: {score:.3f}]\n{chunk.content_text}" ) sources.append({ "file": chunk.source_file, "type": chunk.doc_type.value, "relevance": round(score, 3), "excerpt": chunk.content_description, }) context = "\n\n---\n\n".join(context_parts) # Step 4: Generate answer with Gemini prompt = f"""Based on the search results below, answer the user's question accurately.Cite the sources you used. If you need to speculate beyond the search results,clearly state that you are doing so.[Search Results]{context}[Question]{question}""" response = await self.model.generate_content_async(prompt) return { "answer": response.text, "sources": sources, "tokens_used": response.usage_metadata.total_token_count, }# Usage# engine = RAGQueryEngine(embedding_service, vector_store)# result = await engine.query("Which region grew the most in last month's sales report chart?")# print(result["answer"])# print(f"Sources: {result['sources']}")
Embedding cache: Persist generated embeddings in your database and only recompute when documents are updated.
Context Caching: Use Gemini's Context Caching API when including large sets of document chunks as context, reducing costs for repeated queries.
Model tiering: Use text-embedding-004 for embeddings, gemini-2.5-flash for initial relevance filtering, and gemini-2.5-pro for final answer generation.
# Cost-optimized query: Flash for filtering → Pro for final answerasync def cost_optimized_query(engine: RAGQueryEngine, question: str): # Step 1: Retrieve more candidates with Flash flash_model = genai.GenerativeModel("gemini-2.5-flash-preview-04-17") results = engine.vector_store.search( await engine.embedding_service.embed_query(question), top_k=10 # Retrieve more candidates ) # Step 2: Rerank with Flash (cheaper) rerank_prompt = f"From the following document fragments, select the top 3 most relevant to the question: '{question}'" # ... abbreviated # Step 3: Final answer with Pro (high quality, higher cost) # Use only the filtered 3 results as context
A Note from an Indie Developer
Key Takeaways
Leveraging Gemini API's multimodal capabilities, you can build a RAG pipeline that searches and analyzes text, images, PDFs, and video in a unified system. The four-phase architecture (Ingest → Embed → Index → Query) combined with model tiering for cost optimization creates a practical and scalable system.
Start with a small pipeline handling just text and PDFs, then incrementally add image and video support. For related concepts, also check out the Gemini 2.5 Pro Extended Thinking Overview.
Share
Thank You for Reading
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.