GEMINI LABJP
SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soonSIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
Articles/API / SDK
API / SDK/2026-04-03Intermediate

Building a Production RAG System with Gemini Embedding API and Pinecone

A step-by-step guide to building a production-ready RAG system using Gemini Embedding API and Pinecone. Covers index design, query optimization, chunking strategies, and cost management with practical Python code.

gemini114rag23pinecone2embeddings13vector-database4python132

Why Gemini + Pinecone?

RAG (Retrieval-Augmented Generation) has become a foundational pattern for building AI applications that need to answer questions grounded in your own data. At the heart of any RAG system is a vector database that stores and retrieves semantically similar content.

Among the many vector database options available today, Pinecone stands out as a fully managed, serverless service — no infrastructure to maintain, and it scales automatically with your workload. Pair that with the Gemini Embedding API, which generates high-quality multilingual embeddings up to 3,072 dimensions, and you have a powerful foundation for production RAG.

Prerequisites and Setup

What You'll Need

Before getting started, make sure you have:

Installing Dependencies

pip install google-generativeai pinecone-client python-dotenv

Create a .env file to store your credentials:

# .env
GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
PINECONE_API_KEY=YOUR_PINECONE_API_KEY

Creating a Pinecone Index

When creating a Pinecone index, the dimension must match your embedding model. The gemini-embedding-004 model outputs 768-dimensional vectors.

import os
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec
 
load_dotenv()
 
# Initialize the Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
 
INDEX_NAME = "gemini-rag-index"
 
# Only create the index if it doesn't already exist
if INDEX_NAME not in pc.list_indexes().names():
    pc.create_index(
        name=INDEX_NAME,
        dimension=768,          # Matches gemini-embedding-004 output
        metric="cosine",        # Cosine similarity works well for text search
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"  # Available on the free tier
        )
    )
    print(f"✅ Index '{INDEX_NAME}' created successfully")
else:
    print(f"✅ Index '{INDEX_NAME}' already exists")
 
index = pc.Index(INDEX_NAME)

Generating Embeddings and Indexing Documents

Generating Embeddings with Gemini

import google.generativeai as genai
 
load_dotenv()
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
 
def embed_texts(texts: list[str], task_type: str = "RETRIEVAL_DOCUMENT") -> list[list[float]]:
    """
    Convert a list of texts to embeddings using the Gemini Embedding API.
    task_type:
        RETRIEVAL_DOCUMENT  — use when indexing documents
        RETRIEVAL_QUERY     — use when embedding search queries
    """
    result = genai.embed_content(
        model="models/gemini-embedding-004",
        content=texts,
        task_type=task_type
    )
    # result["embedding"] returns one vector per input text
    return result["embedding"]

The task_type parameter is an important detail. Using RETRIEVAL_DOCUMENT at index time and RETRIEVAL_QUERY at search time tells Gemini to optimize the embedding for its specific role, which measurably improves retrieval quality.

Uploading Documents to Pinecone

def upsert_documents(docs: list[dict], index, batch_size: int = 50) -> None:
    """
    Index a list of documents into Pinecone in batches.
    Each doc should have: {"id": str, "text": str, "metadata": dict}
    """
    for i in range(0, len(docs), batch_size):
        batch = docs[i : i + batch_size]
        texts  = [d["text"] for d in batch]
        embeds = embed_texts(texts, task_type="RETRIEVAL_DOCUMENT")
 
        vectors = [
            {
                "id":       d["id"],
                "values":   embed,
                "metadata": {**d.get("metadata", {}), "text": d["text"]}
            }
            for d, embed in zip(batch, embeds)
        ]
        index.upsert(vectors=vectors)
        print(f"  Indexed: {i + len(batch)}/{len(docs)} documents")
 
# Example usage
sample_docs = [
    {
        "id": "doc-001",
        "text": "Gemini 2.5 Pro is Google's most capable AI model, released in March 2026.",
        "metadata": {"source": "news", "lang": "en"}
    },
    {
        "id": "doc-002",
        "text": "Pinecone is a serverless vector database designed for large-scale AI applications.",
        "metadata": {"source": "docs", "lang": "en"}
    },
    {
        "id": "doc-003",
        "text": "RAG (Retrieval-Augmented Generation) improves LLM accuracy by grounding responses in retrieved context.",
        "metadata": {"source": "tutorial", "lang": "en"}
    },
]
 
upsert_documents(sample_docs, index)
# Expected output: Indexed: 3/3 documents

Building the RAG Pipeline

Retrieving Relevant Context

def search_context(query: str, index, top_k: int = 5) -> list[dict]:
    """
    Retrieve documents from Pinecone that are semantically similar to the query.
    Returns: [{"text": ..., "score": ..., "metadata": ...}, ...]
    """
    # Embed the query with RETRIEVAL_QUERY task type
    query_embed = embed_texts([query], task_type="RETRIEVAL_QUERY")[0]
 
    results = index.query(
        vector=query_embed,
        top_k=top_k,
        include_metadata=True
    )
 
    contexts = []
    for match in results["matches"]:
        contexts.append({
            "text":     match["metadata"].get("text", ""),
            "score":    match["score"],
            "metadata": match["metadata"]
        })
    return contexts

Generating Answers with Retrieved Context

model = genai.GenerativeModel("gemini-2.5-flash")
 
def answer_with_rag(question: str, index, top_k: int = 5) -> str:
    """
    Full RAG pipeline: retrieve relevant docs → build context → generate answer.
    """
    # Step 1: Semantic search
    contexts = search_context(question, index, top_k=top_k)
 
    if not contexts:
        return "No relevant documents found for your query."
 
    # Step 2: Format retrieved context into the prompt
    context_text = "\n\n".join(
        f"[Source {i+1} (similarity: {c['score']:.3f})]\n{c['text']}"
        for i, c in enumerate(contexts)
    )
 
    prompt = f"""Answer the question below using only the provided reference information.
If the answer is not contained in the sources, say "I don't have that information."
 
## Reference Information
{context_text}
 
## Question
{question}
"""
 
    response = model.generate_content(prompt)
    return response.text
 
# Example usage
answer = answer_with_rag("What is Gemini 2.5 Pro?", index)
print(answer)
# Expected output: Gemini 2.5 Pro is Google's most capable AI model, released in March 2026.

Production Optimization Tips

Metadata Filtering for Precision

Pinecone supports filtering by metadata fields before performing vector search. When you have large document collections, filtering first and then searching within a subset is both faster and more precise.

# Only search within English-language documents
filtered_results = index.query(
    vector=query_embed,
    top_k=5,
    filter={"lang": {"$eq": "en"}},  # Metadata filter
    include_metadata=True
)

Chunking Strategy for Better Retrieval

Embedding an entire long document into a single vector dilutes its information. In production, splitting documents into smaller chunks dramatically improves recall.

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """
    Split text into overlapping chunks.
    chunk_size: maximum characters per chunk
    overlap:    overlap between consecutive chunks (preserves context continuity)
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap  # Step back by overlap amount
    return chunks

A chunk size of 400–600 characters with 50-character overlap tends to work well for most document types.

Cost Management on Pinecone's Free Tier

Pinecone's free plan includes 5 indexes and 2 GB of storage. Here are a few tips to stay within those limits:

  • Delete stale vectors regularly: index.delete(ids=["old-doc-001"]) removes individual entries
  • Use batched upserts: Sending 50–100 vectors per request is far more efficient than one at a time
  • Monitor usage in the dashboard: Keep an eye on your monthly query count to avoid throttling

For tips on reducing Gemini Embedding API costs, see the Gemini API Context Caching Complete Guide.

Wrapping Up

This guide covered the essential building blocks for a Gemini + Pinecone RAG system:

  • Index creation: 768 dimensions, cosine similarity, serverless on AWS
  • Embedding generation: task_type distinction between documents and queries
  • Batched upserts: 50 vectors per request keeps the API efficient
  • Metadata filtering: narrows the search scope for better precision
  • Chunking: 500-char chunks with 50-char overlap preserve context

If you want to take this further with multimodal RAG — handling images, PDFs, and video alongside text — the Gemini API Multimodal RAG Pipeline Production Guide goes deep on those patterns.

Gemini + Pinecone is one of the fastest paths from idea to production for RAG. Give it a try with your own knowledge base and see the difference grounded, context-aware answers make.

Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

API / SDK2026-04-14
Gemini API Embeddings vs Vector Databases: Pinecone, Qdrant, pgvector, and Cloud Spanner Compared for Production
Benchmark Pinecone, Qdrant, pgvector, and Cloud Spanner Vector using Gemini text-embedding-004 with real latency, cost, and code. The definitive production selection guide.
API / SDK2026-03-29
Building Production Semantic Search with Gemini Embeddings API — Design, Implementation, and Operations
A comprehensive guide to building production-grade semantic search with Gemini Embeddings API. Covers vector DB selection, reranking, recommendation engines, and cost optimization with practical code.
API / SDK2026-04-19
Building a RAG System With the Gemini API: From Embeddings to Production Deployment
A complete implementation guide for RAG systems using the Gemini Embedding API and Gemini 2.5 Pro. Covers chunk strategy, vector store setup, query expansion, reranking, hallucination mitigation, async optimization, and evaluation.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →