Building a Production RAG System with Gemma 4: Local LLM + Vector Search Architecture

Every developer who has tried to run a local RAG system eventually hits the same wall: how do you connect a locally running LLM to a vector database efficiently, and what does "production-ready" actually look like in that context?

Gemma 4 changes some of the assumptions underlying conventional RAG design. The 256K context window on the 31B Dense and 26B MoE models means that "retrieve a few chunks and concatenate them" is no longer the only viable pattern. Apache 2.0 licensing means every component in the stack can go into a commercial product without legal friction. This guide builds from architecture to production optimization.

Why Gemma 4 Changes RAG Design Assumptions

Traditional RAG systems were built around a core trade-off: context windows were short (4K to 32K tokens), so documents had to be chunked aggressively. Aggressive chunking causes context loss at chunk boundaries — a chunk might contain the answer, but without the surrounding paragraph the model misses the nuance.

Gemma 4's 256K context window (31B Dense and 26B MoE) makes Long-Context RAG practical. Instead of passing the top 3 chunks to the model, you can pass the top 100. At 1,000 tokens per chunk, that's 100K tokens — well within the budget. The model can see relationships across retrieved documents directly, rather than trying to infer them from isolated chunks.

The Apache 2.0 license shift from Gemma 3 is the other key change. The previous restrictive custom license blocked commercial product integration. Gemma 4 removes that barrier entirely: SaaS deployment, OEM bundling, derivative commercial models — all permitted.

System Architecture

A production RAG stack with Gemma 4 has five layers:

┌──────────────────────────────────────────────┐
│ Layer 1: Document Processing                  │
│   PDF/HTML/Markdown → Text extraction         │
│   → Chunking strategy → Embedding generation  │
├──────────────────────────────────────────────┤
│ Layer 2: Vector Store                         │
│   ChromaDB (dev / mid-scale)                  │
│   pgvector (PostgreSQL-integrated / large)    │
├──────────────────────────────────────────────┤
│ Layer 3: Retrieval                            │
│   Semantic search + Keyword search (BM25)     │
│   Hybrid ranking (Reciprocal Rank Fusion)     │
├──────────────────────────────────────────────┤
│ Layer 4: Generation                           │
│   Gemma 4 (Ollama / Gemini API)               │
│   Long-Context RAG or standard chunked RAG    │
├──────────────────────────────────────────────┤
│ Layer 5: Caching & Optimization               │
│   Redis (query cache)                         │
│   Batched embedding generation                │
└──────────────────────────────────────────────┘

Embedding Model Selection

Gemma 4 is a generative model — it doesn't output embedding vectors directly. RAG requires a dedicated embedding model alongside Gemma 4.

Recommended options:

# Option 1: multilingual-e5-large (local, 140+ languages, Apache 2.0)
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("intfloat/multilingual-e5-large")
# 1024 dimensions, strong multilingual performance
 
# Option 2: nomic-embed-text via Ollama (fully local, no external calls)
import requests
 
def embed_ollama(text: str) -> list:
    resp = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": "nomic-embed-text", "prompt": text}
    )
    return resp.json()["embedding"]
 
# Option 3: Google Gemini Embedding API (same ecosystem as Gemma 4)
import google.generativeai as genai
genai.configure(api_key="YOUR_GEMINI_API_KEY")

For a fully offline deployment, pair multilingual-e5-large (sentence-transformers) with Gemma 4 via Ollama. Both are Apache 2.0 and run without internet access.

ChromaDB Integration

ChromaDB is the right choice for development environments and moderate scale (up to a few million documents). Its Python-native API requires no infrastructure setup:

import chromadb
import uuid
from chromadb.utils import embedding_functions
 
client = chromadb.PersistentClient(path="./chroma_db")
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="intfloat/multilingual-e5-large"
)
 
collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}
)
 
def add_documents(texts: list, metadatas: list = None):
    ids = [str(uuid.uuid4()) for _ in texts]
    collection.add(
        documents=texts,
        metadatas=metadatas or [{}] * len(texts),
        ids=ids
    )
 
def semantic_search(query: str, n_results: int = 20, filters: dict = None) -> list:
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        where=filters
    )
    return [
        {"text": doc, "metadata": meta, "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )
    ]
 
# Add a document
add_documents(
    texts=["Gemma 4 was released by Google in April 2026 under Apache 2.0."],
    metadatas=[{"source": "gemma4_overview.pdf", "page": 1}]
)
 
# Retrieve
results = semantic_search("When was Gemma 4 released?")

pgvector Integration (Large Scale, PostgreSQL)

For production systems already running PostgreSQL, pgvector extends the database with vector similarity search. The main advantage: JOIN between vector results and relational data — user profiles, access control lists, document metadata — in a single query.

import psycopg2
import psycopg2.extras
import numpy as np
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("intfloat/multilingual-e5-large")
 
class PgVectorStore:
    def __init__(self, conn_string: str):
        self.conn = psycopg2.connect(conn_string)
        self._init_schema()
 
    def _init_schema(self):
        with self.conn.cursor() as cur:
            cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
            cur.execute("""
                CREATE TABLE IF NOT EXISTS documents (
                    id SERIAL PRIMARY KEY,
                    content TEXT NOT NULL,
                    embedding vector(1024),
                    metadata JSONB DEFAULT \'{}\',
                    created_at TIMESTAMP DEFAULT NOW()
                );
                CREATE INDEX IF NOT EXISTS docs_embedding_idx
                ON documents USING ivfflat (embedding vector_cosine_ops)
                WITH (lists = 100);
            """)
            self.conn.commit()
 
    def upsert(self, content: str, metadata: dict = None):
        embedding = model.encode(content).tolist()
        with self.conn.cursor() as cur:
            cur.execute(
                "INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
                (content, embedding, psycopg2.extras.Json(metadata or {}))
            )
            self.conn.commit()
 
    def search(self, query: str, k: int = 20, min_sim: float = 0.6) -> list:
        q_emb = model.encode(query).tolist()
        with self.conn.cursor() as cur:
            cur.execute("""
                SELECT content, metadata,
                       1 - (embedding <=> %s::vector) AS similarity
                FROM documents
                WHERE 1 - (embedding <=> %s::vector) > %s
                ORDER BY embedding <=> %s::vector
                LIMIT %s;
            """, (q_emb, q_emb, min_sim, q_emb, k))
            return [
                {"content": r[0], "metadata": r[1], "similarity": float(r[2])}
                for r in cur.fetchall()
            ]

Long-Context Generation with Gemma 4

With retrieved chunks, pass them to Gemma 4. The 256K context enables passing all top results in a single prompt:

import requests
 
def generate_answer(
    query: str,
    chunks: list,
    model_id: str = "gemma4:27b",
    long_context: bool = True
) -> str:
 
    if long_context and len(chunks) > 5:
        # Long-Context RAG: all chunks in one prompt
        context = "
 
---
 
".join([
            f"[Source: {c.get(\'metadata\', {}).get(\'source\', \'unknown\')}]\n{c[\'text\']}"
            for c in chunks
        ])
        prompt = (
            f"Using the documents below, answer the question accurately.\n\n"
            f"# Documents\n{context}\n\n"
            f"# Question\n{query}\n\n"
            f"# Answer (cite sources used)"
        )
    else:
        # Standard RAG: top 3 chunks
        context = "\n\n".join([c["text"] for c in chunks[:3]])
        prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
 
    resp = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model_id,
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.1, "top_p": 0.9}
        }
    )
    return resp.json()["response"]

Hybrid Search with Reciprocal Rank Fusion

Semantic search misses exact keyword matches; BM25 misses semantic similarity. Combining both with RRF recovers most of the gaps:

from rank_bm25 import BM25Okapi
import re
 
class HybridRetriever:
    def __init__(self, vector_store, corpus: list):
        self.vs = vector_store
        self.corpus = corpus
        tokenized = [re.findall(r"\w+", d.lower()) for d in corpus]
        self.bm25 = BM25Okapi(tokenized)
 
    def search(self, query: str, k: int = 20, alpha: float = 0.6) -> list:
        # Semantic results
        sem = {r["text"]: r.get("similarity", 0) for r in self.vs.search(query, k=k*2)}
 
        # BM25 results
        tokens = re.findall(r"\w+", query.lower())
        bm25_raw = self.bm25.get_scores(tokens)
        bm25_max = max(bm25_raw) or 1.0
        bm25 = {self.corpus[i]: bm25_raw[i] / bm25_max for i in bm25_raw.argsort()[-k*2:][::-1]}
 
        # RRF fusion
        all_docs = set(sem) | set(bm25)
        fused = {
            d: alpha * sem.get(d, 0) + (1 - alpha) * bm25.get(d, 0)
            for d in all_docs
        }
        return [
            {"text": d, "score": s}
            for d, s in sorted(fused.items(), key=lambda x: x[1], reverse=True)[:k]
        ]

Production Optimization: Batched Embeddings and Caching

For ingesting large document collections, batch embedding reduces wall-clock time significantly:

import redis
import hashlib
import json
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("intfloat/multilingual-e5-large")
cache = redis.Redis(host="localhost", port=6379, db=0)
 
def embed_batch_cached(texts: list, ttl: int = 86400) -> list:
    results = [None] * len(texts)
    uncached_idx = []
 
    for i, text in enumerate(texts):
        key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
        hit = cache.get(key)
        if hit:
            results[i] = json.loads(hit)
        else:
            uncached_idx.append((i, text, key))
 
    if uncached_idx:
        batch_texts = [t for _, t, _ in uncached_idx]
        # batch_size=64 balances GPU memory and throughput
        embeddings = model.encode(batch_texts, batch_size=64).tolist()
        for (i, _, key), emb in zip(uncached_idx, embeddings):
            results[i] = emb
            cache.setex(key, ttl, json.dumps(emb))
 
    return results

Where This Stack Performs Best

Gemma 4 + ChromaDB/pgvector hits its ceiling with three categories of use case:

Internal knowledge bases with privacy constraints: Documents that can't go to external APIs (legal, financial, medical) run entirely on-premises. Apache 2.0 means no per-seat or per-query licensing costs on top of compute.

Multilingual support desks: Gemma 4's 140+ language support with a multilingual embedding model handles mixed-language tickets (Japanese, English, German, Arabic) from a single model instance.

Large codebase search: With 256K context, an entire mid-sized repository fits in a single prompt. Dependency relationships that span files are visible to the model directly — no chunking-induced context loss.

Start with ollama pull gemma4:27b and ChromaDB for local validation. Migrate to pgvector when the document volume exceeds a few million or you need SQL JOINs with relational data.