GEMINI LABJP
SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soonSIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
Articles/Advanced
Advanced/2026-04-14Advanced

Building a Production RAG System with Gemma 4: Local LLM + Vector Search Architecture

A complete guide to building production RAG systems with Gemma 4, ChromaDB, and pgvector. Covers architecture design, chunking strategies, Long-Context RAG using the 256K window, hybrid search, and performance optimization.

gemma-410rag23vector-search7chromadb2pgvector4local-llm2production124

Every developer who has tried to run a local RAG system eventually hits the same wall: how do you connect a locally running LLM to a vector database efficiently, and what does "production-ready" actually look like in that context?

Gemma 4 changes some of the assumptions underlying conventional RAG design. The 256K context window on the 31B Dense and 26B MoE models means that "retrieve a few chunks and concatenate them" is no longer the only viable pattern. Apache 2.0 licensing means every component in the stack can go into a commercial product without legal friction. This guide builds from architecture to production optimization.

Why Gemma 4 Changes RAG Design Assumptions

Traditional RAG systems were built around a core trade-off: context windows were short (4K to 32K tokens), so documents had to be chunked aggressively. Aggressive chunking causes context loss at chunk boundaries — a chunk might contain the answer, but without the surrounding paragraph the model misses the nuance.

Gemma 4's 256K context window (31B Dense and 26B MoE) makes Long-Context RAG practical. Instead of passing the top 3 chunks to the model, you can pass the top 100. At 1,000 tokens per chunk, that's 100K tokens — well within the budget. The model can see relationships across retrieved documents directly, rather than trying to infer them from isolated chunks.

The Apache 2.0 license shift from Gemma 3 is the other key change. The previous restrictive custom license blocked commercial product integration. Gemma 4 removes that barrier entirely: SaaS deployment, OEM bundling, derivative commercial models — all permitted.

System Architecture

A production RAG stack with Gemma 4 has five layers:

┌──────────────────────────────────────────────┐
│ Layer 1: Document Processing                  │
│   PDF/HTML/Markdown → Text extraction         │
│   → Chunking strategy → Embedding generation  │
├──────────────────────────────────────────────┤
│ Layer 2: Vector Store                         │
│   ChromaDB (dev / mid-scale)                  │
│   pgvector (PostgreSQL-integrated / large)    │
├──────────────────────────────────────────────┤
│ Layer 3: Retrieval                            │
│   Semantic search + Keyword search (BM25)     │
│   Hybrid ranking (Reciprocal Rank Fusion)     │
├──────────────────────────────────────────────┤
│ Layer 4: Generation                           │
│   Gemma 4 (Ollama / Gemini API)               │
│   Long-Context RAG or standard chunked RAG    │
├──────────────────────────────────────────────┤
│ Layer 5: Caching & Optimization               │
│   Redis (query cache)                         │
│   Batched embedding generation                │
└──────────────────────────────────────────────┘

Embedding Model Selection

Gemma 4 is a generative model — it doesn't output embedding vectors directly. RAG requires a dedicated embedding model alongside Gemma 4.

Recommended options:

# Option 1: multilingual-e5-large (local, 140+ languages, Apache 2.0)
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("intfloat/multilingual-e5-large")
# 1024 dimensions, strong multilingual performance
 
# Option 2: nomic-embed-text via Ollama (fully local, no external calls)
import requests
 
def embed_ollama(text: str) -> list:
    resp = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": "nomic-embed-text", "prompt": text}
    )
    return resp.json()["embedding"]
 
# Option 3: Google Gemini Embedding API (same ecosystem as Gemma 4)
import google.generativeai as genai
genai.configure(api_key="YOUR_GEMINI_API_KEY")

For a fully offline deployment, pair multilingual-e5-large (sentence-transformers) with Gemma 4 via Ollama. Both are Apache 2.0 and run without internet access.

ChromaDB Integration

ChromaDB is the right choice for development environments and moderate scale (up to a few million documents). Its Python-native API requires no infrastructure setup:

import chromadb
import uuid
from chromadb.utils import embedding_functions
 
client = chromadb.PersistentClient(path="./chroma_db")
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="intfloat/multilingual-e5-large"
)
 
collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}
)
 
def add_documents(texts: list, metadatas: list = None):
    ids = [str(uuid.uuid4()) for _ in texts]
    collection.add(
        documents=texts,
        metadatas=metadatas or [{}] * len(texts),
        ids=ids
    )
 
def semantic_search(query: str, n_results: int = 20, filters: dict = None) -> list:
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        where=filters
    )
    return [
        {"text": doc, "metadata": meta, "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )
    ]
 
# Add a document
add_documents(
    texts=["Gemma 4 was released by Google in April 2026 under Apache 2.0."],
    metadatas=[{"source": "gemma4_overview.pdf", "page": 1}]
)
 
# Retrieve
results = semantic_search("When was Gemma 4 released?")

pgvector Integration (Large Scale, PostgreSQL)

For production systems already running PostgreSQL, pgvector extends the database with vector similarity search. The main advantage: JOIN between vector results and relational data — user profiles, access control lists, document metadata — in a single query.

import psycopg2
import psycopg2.extras
import numpy as np
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("intfloat/multilingual-e5-large")
 
class PgVectorStore:
    def __init__(self, conn_string: str):
        self.conn = psycopg2.connect(conn_string)
        self._init_schema()
 
    def _init_schema(self):
        with self.conn.cursor() as cur:
            cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
            cur.execute("""
                CREATE TABLE IF NOT EXISTS documents (
                    id SERIAL PRIMARY KEY,
                    content TEXT NOT NULL,
                    embedding vector(1024),
                    metadata JSONB DEFAULT \'{}\',
                    created_at TIMESTAMP DEFAULT NOW()
                );
                CREATE INDEX IF NOT EXISTS docs_embedding_idx
                ON documents USING ivfflat (embedding vector_cosine_ops)
                WITH (lists = 100);
            """)
            self.conn.commit()
 
    def upsert(self, content: str, metadata: dict = None):
        embedding = model.encode(content).tolist()
        with self.conn.cursor() as cur:
            cur.execute(
                "INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
                (content, embedding, psycopg2.extras.Json(metadata or {}))
            )
            self.conn.commit()
 
    def search(self, query: str, k: int = 20, min_sim: float = 0.6) -> list:
        q_emb = model.encode(query).tolist()
        with self.conn.cursor() as cur:
            cur.execute("""
                SELECT content, metadata,
                       1 - (embedding <=> %s::vector) AS similarity
                FROM documents
                WHERE 1 - (embedding <=> %s::vector) > %s
                ORDER BY embedding <=> %s::vector
                LIMIT %s;
            """, (q_emb, q_emb, min_sim, q_emb, k))
            return [
                {"content": r[0], "metadata": r[1], "similarity": float(r[2])}
                for r in cur.fetchall()
            ]

Long-Context Generation with Gemma 4

With retrieved chunks, pass them to Gemma 4. The 256K context enables passing all top results in a single prompt:

import requests
 
def generate_answer(
    query: str,
    chunks: list,
    model_id: str = "gemma4:27b",
    long_context: bool = True
) -> str:
 
    if long_context and len(chunks) > 5:
        # Long-Context RAG: all chunks in one prompt
        context = "
 
---
 
".join([
            f"[Source: {c.get(\'metadata\', {}).get(\'source\', \'unknown\')}]\n{c[\'text\']}"
            for c in chunks
        ])
        prompt = (
            f"Using the documents below, answer the question accurately.\n\n"
            f"# Documents\n{context}\n\n"
            f"# Question\n{query}\n\n"
            f"# Answer (cite sources used)"
        )
    else:
        # Standard RAG: top 3 chunks
        context = "\n\n".join([c["text"] for c in chunks[:3]])
        prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
 
    resp = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model_id,
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.1, "top_p": 0.9}
        }
    )
    return resp.json()["response"]

Hybrid Search with Reciprocal Rank Fusion

Semantic search misses exact keyword matches; BM25 misses semantic similarity. Combining both with RRF recovers most of the gaps:

from rank_bm25 import BM25Okapi
import re
 
class HybridRetriever:
    def __init__(self, vector_store, corpus: list):
        self.vs = vector_store
        self.corpus = corpus
        tokenized = [re.findall(r"\w+", d.lower()) for d in corpus]
        self.bm25 = BM25Okapi(tokenized)
 
    def search(self, query: str, k: int = 20, alpha: float = 0.6) -> list:
        # Semantic results
        sem = {r["text"]: r.get("similarity", 0) for r in self.vs.search(query, k=k*2)}
 
        # BM25 results
        tokens = re.findall(r"\w+", query.lower())
        bm25_raw = self.bm25.get_scores(tokens)
        bm25_max = max(bm25_raw) or 1.0
        bm25 = {self.corpus[i]: bm25_raw[i] / bm25_max for i in bm25_raw.argsort()[-k*2:][::-1]}
 
        # RRF fusion
        all_docs = set(sem) | set(bm25)
        fused = {
            d: alpha * sem.get(d, 0) + (1 - alpha) * bm25.get(d, 0)
            for d in all_docs
        }
        return [
            {"text": d, "score": s}
            for d, s in sorted(fused.items(), key=lambda x: x[1], reverse=True)[:k]
        ]

Production Optimization: Batched Embeddings and Caching

For ingesting large document collections, batch embedding reduces wall-clock time significantly:

import redis
import hashlib
import json
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("intfloat/multilingual-e5-large")
cache = redis.Redis(host="localhost", port=6379, db=0)
 
def embed_batch_cached(texts: list, ttl: int = 86400) -> list:
    results = [None] * len(texts)
    uncached_idx = []
 
    for i, text in enumerate(texts):
        key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
        hit = cache.get(key)
        if hit:
            results[i] = json.loads(hit)
        else:
            uncached_idx.append((i, text, key))
 
    if uncached_idx:
        batch_texts = [t for _, t, _ in uncached_idx]
        # batch_size=64 balances GPU memory and throughput
        embeddings = model.encode(batch_texts, batch_size=64).tolist()
        for (i, _, key), emb in zip(uncached_idx, embeddings):
            results[i] = emb
            cache.setex(key, ttl, json.dumps(emb))
 
    return results

Where This Stack Performs Best

Gemma 4 + ChromaDB/pgvector hits its ceiling with three categories of use case:

Internal knowledge bases with privacy constraints: Documents that can't go to external APIs (legal, financial, medical) run entirely on-premises. Apache 2.0 means no per-seat or per-query licensing costs on top of compute.

Multilingual support desks: Gemma 4's 140+ language support with a multilingual embedding model handles mixed-language tickets (Japanese, English, German, Arabic) from a single model instance.

Large codebase search: With 256K context, an entire mid-sized repository fits in a single prompt. Dependency relationships that span files are visible to the model directly — no chunking-induced context loss.

Start with ollama pull gemma4:27b and ChromaDB for local validation. Migrate to pgvector when the document volume exceeds a few million or you need SQL JOINs with relational data.

Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

Advanced2026-04-21
Gemma 4 on MLX in Production: Quantization, Context Management, and Reasoning Fallbacks
Production-grade tuning for Gemma 4 on MLX: quantization choices, context strategies, and how to recover the Reasoning capability via hybrid Gemini API routing.
Advanced2026-03-28
Applying TurboQuant to RAG and Vector Search — New Uses for KV Cache Compression
Google's TurboQuant compression technology extends beyond LLM inference to RAG pipeline vector databases. Learn how embedding vector compression can improve memory efficiency, search speed, and scalability for large-scale RAG systems.
API / SDK2026-04-19
Building a RAG System With the Gemini API: From Embeddings to Production Deployment
A complete implementation guide for RAG systems using the Gemini Embedding API and Gemini 2.5 Pro. Covers chunk strategy, vector store setup, query expansion, reranking, hallucination mitigation, async optimization, and evaluation.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →