Why Gemini + Pinecone?
RAG (Retrieval-Augmented Generation) has become a foundational pattern for building AI applications that need to answer questions grounded in your own data. At the heart of any RAG system is a vector database that stores and retrieves semantically similar content.
Among the many vector database options available today, Pinecone stands out as a fully managed, serverless service — no infrastructure to maintain, and it scales automatically with your workload. Pair that with the Gemini Embedding API, which generates high-quality multilingual embeddings up to 3,072 dimensions, and you have a powerful foundation for production RAG.
Prerequisites and Setup
What You'll Need
Before getting started, make sure you have:
- A Google AI Studio API key (free at aistudio.google.com)
- A Pinecone API key (free tier available at app.pinecone.io)
Installing Dependencies
pip install google-generativeai pinecone-client python-dotenvCreate a .env file to store your credentials:
# .env
GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
PINECONE_API_KEY=YOUR_PINECONE_API_KEYCreating a Pinecone Index
When creating a Pinecone index, the dimension must match your embedding model. The gemini-embedding-004 model outputs 768-dimensional vectors.
import os
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec
load_dotenv()
# Initialize the Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
INDEX_NAME = "gemini-rag-index"
# Only create the index if it doesn't already exist
if INDEX_NAME not in pc.list_indexes().names():
pc.create_index(
name=INDEX_NAME,
dimension=768, # Matches gemini-embedding-004 output
metric="cosine", # Cosine similarity works well for text search
spec=ServerlessSpec(
cloud="aws",
region="us-east-1" # Available on the free tier
)
)
print(f"✅ Index '{INDEX_NAME}' created successfully")
else:
print(f"✅ Index '{INDEX_NAME}' already exists")
index = pc.Index(INDEX_NAME)Generating Embeddings and Indexing Documents
Generating Embeddings with Gemini
import google.generativeai as genai
load_dotenv()
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
def embed_texts(texts: list[str], task_type: str = "RETRIEVAL_DOCUMENT") -> list[list[float]]:
"""
Convert a list of texts to embeddings using the Gemini Embedding API.
task_type:
RETRIEVAL_DOCUMENT — use when indexing documents
RETRIEVAL_QUERY — use when embedding search queries
"""
result = genai.embed_content(
model="models/gemini-embedding-004",
content=texts,
task_type=task_type
)
# result["embedding"] returns one vector per input text
return result["embedding"]The task_type parameter is an important detail. Using RETRIEVAL_DOCUMENT at index time and RETRIEVAL_QUERY at search time tells Gemini to optimize the embedding for its specific role, which measurably improves retrieval quality.
Uploading Documents to Pinecone
def upsert_documents(docs: list[dict], index, batch_size: int = 50) -> None:
"""
Index a list of documents into Pinecone in batches.
Each doc should have: {"id": str, "text": str, "metadata": dict}
"""
for i in range(0, len(docs), batch_size):
batch = docs[i : i + batch_size]
texts = [d["text"] for d in batch]
embeds = embed_texts(texts, task_type="RETRIEVAL_DOCUMENT")
vectors = [
{
"id": d["id"],
"values": embed,
"metadata": {**d.get("metadata", {}), "text": d["text"]}
}
for d, embed in zip(batch, embeds)
]
index.upsert(vectors=vectors)
print(f" Indexed: {i + len(batch)}/{len(docs)} documents")
# Example usage
sample_docs = [
{
"id": "doc-001",
"text": "Gemini 2.5 Pro is Google's most capable AI model, released in March 2026.",
"metadata": {"source": "news", "lang": "en"}
},
{
"id": "doc-002",
"text": "Pinecone is a serverless vector database designed for large-scale AI applications.",
"metadata": {"source": "docs", "lang": "en"}
},
{
"id": "doc-003",
"text": "RAG (Retrieval-Augmented Generation) improves LLM accuracy by grounding responses in retrieved context.",
"metadata": {"source": "tutorial", "lang": "en"}
},
]
upsert_documents(sample_docs, index)
# Expected output: Indexed: 3/3 documentsBuilding the RAG Pipeline
Retrieving Relevant Context
def search_context(query: str, index, top_k: int = 5) -> list[dict]:
"""
Retrieve documents from Pinecone that are semantically similar to the query.
Returns: [{"text": ..., "score": ..., "metadata": ...}, ...]
"""
# Embed the query with RETRIEVAL_QUERY task type
query_embed = embed_texts([query], task_type="RETRIEVAL_QUERY")[0]
results = index.query(
vector=query_embed,
top_k=top_k,
include_metadata=True
)
contexts = []
for match in results["matches"]:
contexts.append({
"text": match["metadata"].get("text", ""),
"score": match["score"],
"metadata": match["metadata"]
})
return contextsGenerating Answers with Retrieved Context
model = genai.GenerativeModel("gemini-2.5-flash")
def answer_with_rag(question: str, index, top_k: int = 5) -> str:
"""
Full RAG pipeline: retrieve relevant docs → build context → generate answer.
"""
# Step 1: Semantic search
contexts = search_context(question, index, top_k=top_k)
if not contexts:
return "No relevant documents found for your query."
# Step 2: Format retrieved context into the prompt
context_text = "\n\n".join(
f"[Source {i+1} (similarity: {c['score']:.3f})]\n{c['text']}"
for i, c in enumerate(contexts)
)
prompt = f"""Answer the question below using only the provided reference information.
If the answer is not contained in the sources, say "I don't have that information."
## Reference Information
{context_text}
## Question
{question}
"""
response = model.generate_content(prompt)
return response.text
# Example usage
answer = answer_with_rag("What is Gemini 2.5 Pro?", index)
print(answer)
# Expected output: Gemini 2.5 Pro is Google's most capable AI model, released in March 2026.Production Optimization Tips
Metadata Filtering for Precision
Pinecone supports filtering by metadata fields before performing vector search. When you have large document collections, filtering first and then searching within a subset is both faster and more precise.
# Only search within English-language documents
filtered_results = index.query(
vector=query_embed,
top_k=5,
filter={"lang": {"$eq": "en"}}, # Metadata filter
include_metadata=True
)Chunking Strategy for Better Retrieval
Embedding an entire long document into a single vector dilutes its information. In production, splitting documents into smaller chunks dramatically improves recall.
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""
Split text into overlapping chunks.
chunk_size: maximum characters per chunk
overlap: overlap between consecutive chunks (preserves context continuity)
"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap # Step back by overlap amount
return chunksA chunk size of 400–600 characters with 50-character overlap tends to work well for most document types.
Cost Management on Pinecone's Free Tier
Pinecone's free plan includes 5 indexes and 2 GB of storage. Here are a few tips to stay within those limits:
- Delete stale vectors regularly:
index.delete(ids=["old-doc-001"])removes individual entries - Use batched upserts: Sending 50–100 vectors per request is far more efficient than one at a time
- Monitor usage in the dashboard: Keep an eye on your monthly query count to avoid throttling
For tips on reducing Gemini Embedding API costs, see the Gemini API Context Caching Complete Guide.
Wrapping Up
This guide covered the essential building blocks for a Gemini + Pinecone RAG system:
- Index creation: 768 dimensions, cosine similarity, serverless on AWS
- Embedding generation:
task_typedistinction between documents and queries - Batched upserts: 50 vectors per request keeps the API efficient
- Metadata filtering: narrows the search scope for better precision
- Chunking: 500-char chunks with 50-char overlap preserve context
If you want to take this further with multimodal RAG — handling images, PDFs, and video alongside text — the Gemini API Multimodal RAG Pipeline Production Guide goes deep on those patterns.
Gemini + Pinecone is one of the fastest paths from idea to production for RAG. Give it a try with your own knowledge base and see the difference grounded, context-aware answers make.