Every developer who has tried to run a local RAG system eventually hits the same wall: how do you connect a locally running LLM to a vector database efficiently, and what does "production-ready" actually look like in that context?
Gemma 4 changes some of the assumptions underlying conventional RAG design. The 256K context window on the 31B Dense and 26B MoE models means that "retrieve a few chunks and concatenate them" is no longer the only viable pattern. Apache 2.0 licensing means every component in the stack can go into a commercial product without legal friction. This guide builds from architecture to production optimization.
Why Gemma 4 Changes RAG Design Assumptions
Traditional RAG systems were built around a core trade-off: context windows were short (4K to 32K tokens), so documents had to be chunked aggressively. Aggressive chunking causes context loss at chunk boundaries — a chunk might contain the answer, but without the surrounding paragraph the model misses the nuance.
Gemma 4's 256K context window (31B Dense and 26B MoE) makes Long-Context RAG practical. Instead of passing the top 3 chunks to the model, you can pass the top 100. At 1,000 tokens per chunk, that's 100K tokens — well within the budget. The model can see relationships across retrieved documents directly, rather than trying to infer them from isolated chunks.
The Apache 2.0 license shift from Gemma 3 is the other key change. The previous restrictive custom license blocked commercial product integration. Gemma 4 removes that barrier entirely: SaaS deployment, OEM bundling, derivative commercial models — all permitted.
System Architecture
A production RAG stack with Gemma 4 has five layers:
┌──────────────────────────────────────────────┐
│ Layer 1: Document Processing │
│ PDF/HTML/Markdown → Text extraction │
│ → Chunking strategy → Embedding generation │
├──────────────────────────────────────────────┤
│ Layer 2: Vector Store │
│ ChromaDB (dev / mid-scale) │
│ pgvector (PostgreSQL-integrated / large) │
├──────────────────────────────────────────────┤
│ Layer 3: Retrieval │
│ Semantic search + Keyword search (BM25) │
│ Hybrid ranking (Reciprocal Rank Fusion) │
├──────────────────────────────────────────────┤
│ Layer 4: Generation │
│ Gemma 4 (Ollama / Gemini API) │
│ Long-Context RAG or standard chunked RAG │
├──────────────────────────────────────────────┤
│ Layer 5: Caching & Optimization │
│ Redis (query cache) │
│ Batched embedding generation │
└──────────────────────────────────────────────┘
Embedding Model Selection
Gemma 4 is a generative model — it doesn't output embedding vectors directly. RAG requires a dedicated embedding model alongside Gemma 4.
Recommended options:
# Option 1: multilingual-e5-large (local, 140+ languages, Apache 2.0)
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("intfloat/multilingual-e5-large")
# 1024 dimensions, strong multilingual performance
# Option 2: nomic-embed-text via Ollama (fully local, no external calls)
import requests
def embed_ollama(text: str) -> list:
resp = requests.post(
"http://localhost:11434/api/embeddings",
json={"model": "nomic-embed-text", "prompt": text}
)
return resp.json()["embedding"]
# Option 3: Google Gemini Embedding API (same ecosystem as Gemma 4)
import google.generativeai as genai
genai.configure(api_key="YOUR_GEMINI_API_KEY")For a fully offline deployment, pair multilingual-e5-large (sentence-transformers) with Gemma 4 via Ollama. Both are Apache 2.0 and run without internet access.
ChromaDB Integration
ChromaDB is the right choice for development environments and moderate scale (up to a few million documents). Its Python-native API requires no infrastructure setup:
import chromadb
import uuid
from chromadb.utils import embedding_functions
client = chromadb.PersistentClient(path="./chroma_db")
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="intfloat/multilingual-e5-large"
)
collection = client.get_or_create_collection(
name="knowledge_base",
embedding_function=embedding_fn,
metadata={"hnsw:space": "cosine"}
)
def add_documents(texts: list, metadatas: list = None):
ids = [str(uuid.uuid4()) for _ in texts]
collection.add(
documents=texts,
metadatas=metadatas or [{}] * len(texts),
ids=ids
)
def semantic_search(query: str, n_results: int = 20, filters: dict = None) -> list:
results = collection.query(
query_texts=[query],
n_results=n_results,
where=filters
)
return [
{"text": doc, "metadata": meta, "distance": dist}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
)
]
# Add a document
add_documents(
texts=["Gemma 4 was released by Google in April 2026 under Apache 2.0."],
metadatas=[{"source": "gemma4_overview.pdf", "page": 1}]
)
# Retrieve
results = semantic_search("When was Gemma 4 released?")pgvector Integration (Large Scale, PostgreSQL)
For production systems already running PostgreSQL, pgvector extends the database with vector similarity search. The main advantage: JOIN between vector results and relational data — user profiles, access control lists, document metadata — in a single query.
import psycopg2
import psycopg2.extras
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/multilingual-e5-large")
class PgVectorStore:
def __init__(self, conn_string: str):
self.conn = psycopg2.connect(conn_string)
self._init_schema()
def _init_schema(self):
with self.conn.cursor() as cur:
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1024),
metadata JSONB DEFAULT \'{}\',
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS docs_embedding_idx
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
""")
self.conn.commit()
def upsert(self, content: str, metadata: dict = None):
embedding = model.encode(content).tolist()
with self.conn.cursor() as cur:
cur.execute(
"INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
(content, embedding, psycopg2.extras.Json(metadata or {}))
)
self.conn.commit()
def search(self, query: str, k: int = 20, min_sim: float = 0.6) -> list:
q_emb = model.encode(query).tolist()
with self.conn.cursor() as cur:
cur.execute("""
SELECT content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE 1 - (embedding <=> %s::vector) > %s
ORDER BY embedding <=> %s::vector
LIMIT %s;
""", (q_emb, q_emb, min_sim, q_emb, k))
return [
{"content": r[0], "metadata": r[1], "similarity": float(r[2])}
for r in cur.fetchall()
]Long-Context Generation with Gemma 4
With retrieved chunks, pass them to Gemma 4. The 256K context enables passing all top results in a single prompt:
import requests
def generate_answer(
query: str,
chunks: list,
model_id: str = "gemma4:27b",
long_context: bool = True
) -> str:
if long_context and len(chunks) > 5:
# Long-Context RAG: all chunks in one prompt
context = "
---
".join([
f"[Source: {c.get(\'metadata\', {}).get(\'source\', \'unknown\')}]\n{c[\'text\']}"
for c in chunks
])
prompt = (
f"Using the documents below, answer the question accurately.\n\n"
f"# Documents\n{context}\n\n"
f"# Question\n{query}\n\n"
f"# Answer (cite sources used)"
)
else:
# Standard RAG: top 3 chunks
context = "\n\n".join([c["text"] for c in chunks[:3]])
prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
resp = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model_id,
"prompt": prompt,
"stream": False,
"options": {"temperature": 0.1, "top_p": 0.9}
}
)
return resp.json()["response"]Hybrid Search with Reciprocal Rank Fusion
Semantic search misses exact keyword matches; BM25 misses semantic similarity. Combining both with RRF recovers most of the gaps:
from rank_bm25 import BM25Okapi
import re
class HybridRetriever:
def __init__(self, vector_store, corpus: list):
self.vs = vector_store
self.corpus = corpus
tokenized = [re.findall(r"\w+", d.lower()) for d in corpus]
self.bm25 = BM25Okapi(tokenized)
def search(self, query: str, k: int = 20, alpha: float = 0.6) -> list:
# Semantic results
sem = {r["text"]: r.get("similarity", 0) for r in self.vs.search(query, k=k*2)}
# BM25 results
tokens = re.findall(r"\w+", query.lower())
bm25_raw = self.bm25.get_scores(tokens)
bm25_max = max(bm25_raw) or 1.0
bm25 = {self.corpus[i]: bm25_raw[i] / bm25_max for i in bm25_raw.argsort()[-k*2:][::-1]}
# RRF fusion
all_docs = set(sem) | set(bm25)
fused = {
d: alpha * sem.get(d, 0) + (1 - alpha) * bm25.get(d, 0)
for d in all_docs
}
return [
{"text": d, "score": s}
for d, s in sorted(fused.items(), key=lambda x: x[1], reverse=True)[:k]
]Production Optimization: Batched Embeddings and Caching
For ingesting large document collections, batch embedding reduces wall-clock time significantly:
import redis
import hashlib
import json
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/multilingual-e5-large")
cache = redis.Redis(host="localhost", port=6379, db=0)
def embed_batch_cached(texts: list, ttl: int = 86400) -> list:
results = [None] * len(texts)
uncached_idx = []
for i, text in enumerate(texts):
key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
hit = cache.get(key)
if hit:
results[i] = json.loads(hit)
else:
uncached_idx.append((i, text, key))
if uncached_idx:
batch_texts = [t for _, t, _ in uncached_idx]
# batch_size=64 balances GPU memory and throughput
embeddings = model.encode(batch_texts, batch_size=64).tolist()
for (i, _, key), emb in zip(uncached_idx, embeddings):
results[i] = emb
cache.setex(key, ttl, json.dumps(emb))
return resultsWhere This Stack Performs Best
Gemma 4 + ChromaDB/pgvector hits its ceiling with three categories of use case:
Internal knowledge bases with privacy constraints: Documents that can't go to external APIs (legal, financial, medical) run entirely on-premises. Apache 2.0 means no per-seat or per-query licensing costs on top of compute.
Multilingual support desks: Gemma 4's 140+ language support with a multilingual embedding model handles mixed-language tickets (Japanese, English, German, Arabic) from a single model instance.
Large codebase search: With 256K context, an entire mid-sized repository fits in a single prompt. Dependency relationships that span files are visible to the model directly — no chunking-induced context loss.
Start with ollama pull gemma4:27b and ChromaDB for local validation. Migrate to pgvector when the document volume exceeds a few million or you need SQL JOINs with relational data.