Building a Production RAG System with Gemini Embedding API and Pinecone

Why Gemini + Pinecone?

RAG (Retrieval-Augmented Generation) has become a foundational pattern for building AI applications that need to answer questions grounded in your own data. At the heart of any RAG system is a vector database that stores and retrieves semantically similar content.

Among the many vector database options available today, Pinecone stands out as a fully managed, serverless service — no infrastructure to maintain, and it scales automatically with your workload. Pair that with the Gemini Embedding API, which generates high-quality multilingual embeddings up to 3,072 dimensions, and you have a powerful foundation for production RAG.

Prerequisites and Setup

What You'll Need

Before getting started, make sure you have:

A Google AI Studio API key (free at aistudio.google.com)
A Pinecone API key (free tier available at app.pinecone.io)

Installing Dependencies

pip install google-generativeai pinecone-client python-dotenv

Create a .env file to store your credentials:

# .env
GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
PINECONE_API_KEY=YOUR_PINECONE_API_KEY

Creating a Pinecone Index

When creating a Pinecone index, the dimension must match your embedding model. The gemini-embedding-004 model outputs 768-dimensional vectors.

import os
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec
 
load_dotenv()
 
# Initialize the Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
 
INDEX_NAME = "gemini-rag-index"
 
# Only create the index if it doesn't already exist
if INDEX_NAME not in pc.list_indexes().names():
    pc.create_index(
        name=INDEX_NAME,
        dimension=768,          # Matches gemini-embedding-004 output
        metric="cosine",        # Cosine similarity works well for text search
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"  # Available on the free tier
        )
    )
    print(f"✅ Index '{INDEX_NAME}' created successfully")
else:
    print(f"✅ Index '{INDEX_NAME}' already exists")
 
index = pc.Index(INDEX_NAME)

Generating Embeddings and Indexing Documents

Generating Embeddings with Gemini

import google.generativeai as genai
 
load_dotenv()
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
 
def embed_texts(texts: list[str], task_type: str = "RETRIEVAL_DOCUMENT") -> list[list[float]]:
    """
    Convert a list of texts to embeddings using the Gemini Embedding API.
    task_type:
        RETRIEVAL_DOCUMENT  — use when indexing documents
        RETRIEVAL_QUERY     — use when embedding search queries
    """
    result = genai.embed_content(
        model="models/gemini-embedding-004",
        content=texts,
        task_type=task_type
    )
    # result["embedding"] returns one vector per input text
    return result["embedding"]

The task_type parameter is an important detail. Using RETRIEVAL_DOCUMENT at index time and RETRIEVAL_QUERY at search time tells Gemini to optimize the embedding for its specific role, which measurably improves retrieval quality.

Uploading Documents to Pinecone

def upsert_documents(docs: list[dict], index, batch_size: int = 50) -> None:
    """
    Index a list of documents into Pinecone in batches.
    Each doc should have: {"id": str, "text": str, "metadata": dict}
    """
    for i in range(0, len(docs), batch_size):
        batch = docs[i : i + batch_size]
        texts  = [d["text"] for d in batch]
        embeds = embed_texts(texts, task_type="RETRIEVAL_DOCUMENT")
 
        vectors = [
            {
                "id":       d["id"],
                "values":   embed,
                "metadata": {**d.get("metadata", {}), "text": d["text"]}
            }
            for d, embed in zip(batch, embeds)
        ]
        index.upsert(vectors=vectors)
        print(f"  Indexed: {i + len(batch)}/{len(docs)} documents")
 
# Example usage
sample_docs = [
    {
        "id": "doc-001",
        "text": "Gemini 2.5 Pro is Google's most capable AI model, released in March 2026.",
        "metadata": {"source": "news", "lang": "en"}
    },
    {
        "id": "doc-002",
        "text": "Pinecone is a serverless vector database designed for large-scale AI applications.",
        "metadata": {"source": "docs", "lang": "en"}
    },
    {
        "id": "doc-003",
        "text": "RAG (Retrieval-Augmented Generation) improves LLM accuracy by grounding responses in retrieved context.",
        "metadata": {"source": "tutorial", "lang": "en"}
    },
]
 
upsert_documents(sample_docs, index)
# Expected output: Indexed: 3/3 documents

Building the RAG Pipeline

Retrieving Relevant Context

def search_context(query: str, index, top_k: int = 5) -> list[dict]:
    """
    Retrieve documents from Pinecone that are semantically similar to the query.
    Returns: [{"text": ..., "score": ..., "metadata": ...}, ...]
    """
    # Embed the query with RETRIEVAL_QUERY task type
    query_embed = embed_texts([query], task_type="RETRIEVAL_QUERY")[0]
 
    results = index.query(
        vector=query_embed,
        top_k=top_k,
        include_metadata=True
    )
 
    contexts = []
    for match in results["matches"]:
        contexts.append({
            "text":     match["metadata"].get("text", ""),
            "score":    match["score"],
            "metadata": match["metadata"]
        })
    return contexts

Generating Answers with Retrieved Context

model = genai.GenerativeModel("gemini-2.5-flash")
 
def answer_with_rag(question: str, index, top_k: int = 5) -> str:
    """
    Full RAG pipeline: retrieve relevant docs → build context → generate answer.
    """
    # Step 1: Semantic search
    contexts = search_context(question, index, top_k=top_k)
 
    if not contexts:
        return "No relevant documents found for your query."
 
    # Step 2: Format retrieved context into the prompt
    context_text = "\n\n".join(
        f"[Source {i+1} (similarity: {c['score']:.3f})]\n{c['text']}"
        for i, c in enumerate(contexts)
    )
 
    prompt = f"""Answer the question below using only the provided reference information.
If the answer is not contained in the sources, say "I don't have that information."
 
## Reference Information
{context_text}
 
## Question
{question}
"""
 
    response = model.generate_content(prompt)
    return response.text
 
# Example usage
answer = answer_with_rag("What is Gemini 2.5 Pro?", index)
print(answer)
# Expected output: Gemini 2.5 Pro is Google's most capable AI model, released in March 2026.

Production Optimization Tips

Metadata Filtering for Precision

Pinecone supports filtering by metadata fields before performing vector search. When you have large document collections, filtering first and then searching within a subset is both faster and more precise.

# Only search within English-language documents
filtered_results = index.query(
    vector=query_embed,
    top_k=5,
    filter={"lang": {"$eq": "en"}},  # Metadata filter
    include_metadata=True
)

Chunking Strategy for Better Retrieval

Embedding an entire long document into a single vector dilutes its information. In production, splitting documents into smaller chunks dramatically improves recall.

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """
    Split text into overlapping chunks.
    chunk_size: maximum characters per chunk
    overlap:    overlap between consecutive chunks (preserves context continuity)
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap  # Step back by overlap amount
    return chunks

A chunk size of 400–600 characters with 50-character overlap tends to work well for most document types.

Cost Management on Pinecone's Free Tier

Pinecone's free plan includes 5 indexes and 2 GB of storage. Here are a few tips to stay within those limits:

Delete stale vectors regularly: index.delete(ids=["old-doc-001"]) removes individual entries
Use batched upserts: Sending 50–100 vectors per request is far more efficient than one at a time
Monitor usage in the dashboard: Keep an eye on your monthly query count to avoid throttling

For tips on reducing Gemini Embedding API costs, see the Gemini API Context Caching Complete Guide.

Wrapping Up

This guide covered the essential building blocks for a Gemini + Pinecone RAG system:

Index creation: 768 dimensions, cosine similarity, serverless on AWS
Embedding generation: task_type distinction between documents and queries
Batched upserts: 50 vectors per request keeps the API efficient
Metadata filtering: narrows the search scope for better precision
Chunking: 500-char chunks with 50-char overlap preserve context

If you want to take this further with multimodal RAG — handling images, PDFs, and video alongside text — the Gemini API Multimodal RAG Pipeline Production Guide goes deep on those patterns.

Gemini + Pinecone is one of the fastest paths from idea to production for RAG. Give it a try with your own knowledge base and see the difference grounded, context-aware answers make.