GEMINI LABJP
CHROME — Gemini in Chrome lands on Android in late June with Nano Banana and auto browse, rolling out first to 4GB+ RAM devices set to en-USOMNI-FLASH — Gemini Omni Flash rolls out to all AI Plus, Pro, and Ultra subscribers, and is free for adults in YouTube Shorts Remix and YouTube CreateDEADLINE — 12 days until the image preview models shut down on Jun 25 — migrate gemini-3.1-flash and 3-pro image-preview workloads to GA versions nowSCHEMA — The legacy Interactions API schema was removed on Jun 8; double-check your migration to the steps array and the new response_formatFLASH-GA — Gemini 3.5 Flash is generally available via Antigravity, the Gemini API, AI Studio, and Android StudioSUITE — Deep Think, Deep Research, Gemini Live, and Gemini Omni now form one flow: reason, research, talk, and createCHROME — Gemini in Chrome lands on Android in late June with Nano Banana and auto browse, rolling out first to 4GB+ RAM devices set to en-USOMNI-FLASH — Gemini Omni Flash rolls out to all AI Plus, Pro, and Ultra subscribers, and is free for adults in YouTube Shorts Remix and YouTube CreateDEADLINE — 12 days until the image preview models shut down on Jun 25 — migrate gemini-3.1-flash and 3-pro image-preview workloads to GA versions nowSCHEMA — The legacy Interactions API schema was removed on Jun 8; double-check your migration to the steps array and the new response_formatFLASH-GA — Gemini 3.5 Flash is generally available via Antigravity, the Gemini API, AI Studio, and Android StudioSUITE — Deep Think, Deep Research, Gemini Live, and Gemini Omni now form one flow: reason, research, talk, and create
Articles/API / SDK
API / SDK/2026-06-13Advanced

Rebuilding a Three-Layer RAG Cache After Migrating to Gemini 3.5 Flash

When Gemini 2.0 Flash was retired, I rebuilt my RAG caching stack around 3.5 Flash. Here are the working implementations for response, semantic, and embedding caches, measured hit rates from production, and how self-managed caching divides the work with the API's Context Caching.

gemini-3-5-flash2rag17cost-optimization22caching2gemini-embedding-2

Premium Article

When Gemini 2.0 Flash was retired on June 1, my RAG pipeline — an indie developer project running on a solo budget — moved to 3.5 Flash. I would have loved for the migration to end with swapping a model ID and running the test suite. In practice, that was where the real work started: once pricing and token-handling assumptions change, a cache design that used to be optimal quietly stops being so.

So while I was in there, I peeled off every cache layer and rebuilt the stack from scratch. The punchline: I ended up with the same three-layer structure as before. What changed was the priority of each layer and how the work is divided with the API's own Context Caching. This article walks through the rebuilt design with working code and production numbers.

Cut your caches along the cost boundaries

A single RAG request spends money in exactly three places.

  1. Query embedding — the API call that vectorizes the question
  2. Vector search — the database query (billed per search on managed services)
  3. Answer generation — sending context plus question to the model

Items 2 and 3 dominate the invoice. That is why I find it cleanest to organize caches by which cost point each one eliminates:

  • L1 (response cache): stores answers to identical questions; a hit skips all three steps
  • L2 (semantic cache): reuses retrieval results for semantically similar past queries; skips step 2
  • L3 (embedding cache): never recomputes the embedding of an identical string; skips step 1

The higher the layer, the bigger the saving on a hit. That hierarchy survived the migration intact. What shifted was my judgment about how much to keep self-managed — more on that below.

L1: the response cache is still the first thing to install

User questions repeat far more than you expect. In FAQ-shaped workloads, exact matches alone catch around a third of traffic. In my production data for the past week, L1 hit 34% — one in three generation calls evaporates before it ever reaches the API.

import hashlib
import json
import redis
 
from google import genai
 
client = genai.Client()  # reads GEMINI_API_KEY from the environment
r = redis.Redis(decode_responses=True)
RESP_TTL = 60 * 60 * 24 * 7  # 7 days
 
def response_cache_key(tenant: str, query: str, filters: dict) -> str:
    """Build a deterministic key from tenant, normalized query, and filters."""
    payload = json.dumps(
        {"t": tenant, "q": " ".join(query.lower().split()), "f": filters},
        sort_keys=True,
    )
    return "rag:l1:" + hashlib.sha256(payload.encode()).hexdigest()
 
def answer_with_l1(tenant: str, query: str, filters: dict) -> dict:
    key = response_cache_key(tenant, query, filters)
    if (hit := r.get(key)) is not None:
        return json.loads(hit)
    result = run_rag_pipeline(tenant, query, filters)  # includes L2/L3
    r.setex(key, RESP_TTL, json.dumps(result))
    return result

Three deliberate choices here. First, normalization via lower().split() collapses case and whitespace variants into one key. Second, the tenant ID goes into the key, always — I once came within a code review of shipping a version without it, which would have served one tenant's answers to another. Permission boundaries belong in the cache key itself. Third, the TTL: keep serving yesterday's answer after a document update and user trust erodes quietly.

For invalidation I don't rely on TTL alone. Each cached answer carries the IDs of the documents it cites, and a document-update event purges every answer referencing it. At weekly update cadence, that mechanism alone has kept stale-answer incidents at zero.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Working three-layer cache code (response, semantic, embedding) built on the google-genai SDK and Redis
The verification process behind a 0.92 similarity threshold, plus invalidation design against false hits, tenant bleed, and stale answers
How to divide responsibilities between the API's Context Caching and your own layers, with per-layer hit-rate instrumentation and an investment rule for L2
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-04-19
Gemini API Caching in Production — Operational Notes from an Indie Mobile Developer
Field notes on running Gemini API's Context Caching and Implicit Caching together inside indie mobile apps. Includes working Python code, six months of measured costs from AdMob-funded apps, and seven non-obvious operational pitfalls.
API / SDK2026-06-13
Where to Adopt Gemini 3.5 Flash GA First — Per-Workload Evaluation and a Staged Rollout with a Model Router
How I migrated production workloads to Gemini 3.5 Flash GA in stages: a per-workload evaluation harness, measured results, an env-based model router, and rollback design.
API / SDK2026-06-01
Mixing Gemini 2.5 Flash and Flash-Lite for App Store Localization
An operations log from running the same wallpaper-app store copy through both Gemini 2.5 Flash and Flash-Lite. Real cost gaps, where the lighter model breaks down, and how I now route by text type and locale.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →