GEMINI LABJP
SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soonSIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
Articles/API / SDK
API / SDK/2026-05-05Advanced

Cutting Gemini API Costs by 80%: Context Caching and Implicit Caching

A hands-on guide to reducing Gemini API costs by 80% using Context Caching and Implicit Caching. Includes decision frameworks, working code examples, and a troubleshooting checklist for when caching stops working in production.

gemini-api285context-caching6implicit-caching2cost-optimization25production124python132

Premium Article

Nothing sharpens your focus on token costs quite like a billing alert at 2 AM.

When the Gemini API bill for a service I was running crossed $200 in a single month, my first instinct was to blame token-heavy user inputs. After half a day of digging, the actual culprit turned out to be something far more avoidable: I was sending the same 15,000-token system prompt on every single request. Every time a user asked a one-line question, my backend was shipping a small novel of context to Google's servers.

This guide is the documentation I wish I'd had before that invoice arrived. It covers the two caching mechanisms Gemini API provides — Context Caching and Implicit Caching — with production-ready implementations, real cost math, and a troubleshooting checklist for the failure modes that aren't obvious until you hit them in production.

The Three Types of Caching in the Gemini Ecosystem

Before diving into implementation, it helps to be clear on what "caching" means in this context, because the term is overloaded.

Context Caching is an explicit, user-managed feature. You create a cache object containing a large, static prompt prefix (system instructions, reference documents, few-shot examples), and subsequent requests reference that cache by ID. Gemini charges reduced rates for cached token inputs, plus a small storage fee. This is the most reliable way to reduce costs for workloads with a large, stable context.

Implicit Caching is automatic. Gemini detects when consecutive requests share a common prefix and caches that prefix server-side without any configuration on your part. It's opportunistic — it may or may not trigger depending on traffic patterns and Google's infrastructure — but when it does kick in, it's free cost reduction with zero code changes.

Application-layer KV Caching (Redis, Cloudflare KV, etc.) means caching full API responses for identical or near-identical queries. This is separate from anything Gemini does natively and is generally the highest-leverage approach for workloads where the same question gets asked repeatedly.

This guide focuses on the first two. Understanding both — and knowing when to use each — is where most of the cost reduction opportunity lies.

Context Caching: Reliable, Explicit, and Worth the Setup Time

When Context Caching Pays Off

Context Caching is the right tool when:

  • You have a system prompt or reference content exceeding ~5,000 tokens (the minimum for cost-effective caching is 32,768 tokens for Gemini 2.5 Pro)
  • That content stays stable across many requests
  • You're running more than ~50 requests per day against the same context

If your system prompt is a paragraph and changes frequently, Context Caching isn't the right approach. The math only works when the same large block of tokens gets reused across many calls.

What the Cost Math Actually Looks Like

For Gemini 2.5 Pro (approximate 2026 pricing):

Standard input:    $1.25 per 1M tokens
Cached input:      $0.3125 per 1M tokens (~25% of standard)
Cache storage:     $1.00 per 1M tokens per hour

A concrete example: 20,000-token system prompt, 50,000 requests per month.

Without caching:
  20,000 tokens × 50,000 requests × $0.00000125 = $1,250/month

With Context Caching (1-hour TTL):
  Storage: 20,000 tokens × 720 hours × $0.000001 = $14.40/month
  Cached input: 20,000 × 50,000 × $0.000000313 = $313/month
  Non-cached input (user queries ~200 tokens): 200 × 50,000 × $0.00000125 = $12.50/month
  Total: ~$340/month → 73% reduction

The more requests you make against the same cached content, the better the economics get.

A Production-Ready Implementation

import google.generativeai as genai
from google.generativeai import caching
import time
 
genai.configure(api_key="YOUR_GEMINI_API_KEY")
 
# Your large, static context — system instructions, reference docs, etc.
SYSTEM_CONTEXT = """
You are a customer support AI for Acme Corp.
Use the following product documentation and FAQ to answer questions accurately.
 
[Product Manual — approximately 20,000 tokens of content]
Section 1: Getting Started
...
Section 12: Advanced Troubleshooting
...
"""
 
def get_or_create_cache(
    content: str,
    display_name: str = "prod-support-cache",
    ttl_seconds: int = 3600
) -> caching.CachedContent:
    """
    Reuse an existing cache if one exists; create a new one otherwise.
    This prevents unnecessary re-creation on restarts.
    """
    for cached in caching.CachedContent.list():
        if cached.display_name == display_name:
            print(f"Reusing existing cache: {cached.name}")
            return cached
 
    print(f"Creating new cache with TTL={ttl_seconds}s...")
    cache = caching.CachedContent.create(
        model="models/gemini-2.5-pro",
        system_instruction=content,
        ttl=f"{ttl_seconds}s",
        display_name=display_name
    )
    print(f"Cache created: {cache.name}, expires: {cache.expire_time}")
    return cache
 
def chat_with_context_cache(cache: caching.CachedContent, user_message: str) -> dict:
    """
    Send a request using an existing cache.
    Returns the response text and usage stats for cost tracking.
    """
    model = genai.GenerativeModel.from_cached_content(cached_content=cache)
    response = model.generate_content(user_message)
    
    usage = response.usage_metadata
    return {
        "text": response.text,
        "cached_tokens": usage.cached_content_token_count or 0,
        "total_input_tokens": usage.prompt_token_count or 0,
        "output_tokens": usage.candidates_token_count or 0,
    }
 
# Usage
cache = get_or_create_cache(SYSTEM_CONTEXT)
 
test_queries = [
    "What is your return policy?",
    "How long does shipping take?",
    "Can I change my order after placing it?",
]
 
for query in test_queries:
    result = chat_with_context_cache(cache, query)
    cache_ratio = result["cached_tokens"] / max(result["total_input_tokens"], 1) * 100
    print(f"Q: {query}")
    print(f"   Cached: {result['cached_tokens']} tokens ({cache_ratio:.0f}% of input)")
    print(f"   Total input: {result['total_input_tokens']} tokens\n")

The get_or_create_cache() pattern is important for production — without it, every deployment restart creates a new cache and pays the storage cost for a duplicate. Always check for an existing cache by display name before creating.

TTL Design: Shorter Isn't Always Cheaper

The intuition that "shorter TTL = less storage cost" is correct in isolation, but misleading in practice. A cache that expires and gets re-created every 30 minutes during high-traffic periods costs more than one held for 2 hours, because the re-creation events themselves incur processing overhead and may cause temporary cache misses.

A pragmatic approach:

def choose_ttl(daily_request_estimate: int) -> int:
    """
    Simple TTL heuristic based on expected request volume.
    Tune this for your actual traffic patterns.
    """
    if daily_request_estimate > 500:
        return 7200   # 2 hours — high volume, keep cache warm
    elif daily_request_estimate > 100:
        return 3600   # 1 hour — moderate volume
    else:
        return 1800   # 30 minutes — low volume, minimize storage cost

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Walk through the exact design decisions that reduced a $200/month API bill to under $40 using Context Caching
Get working code for Context Caching, Implicit Caching, and cache hit rate monitoring — ready to drop into production
Learn the top 5 reasons caching silently fails in production and how to diagnose each one in under 5 minutes
Secure payment via Stripe · Cancel anytime
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-05-08
Gemini API Implicit Caching Not Working — Troubleshooting Guide by Root Cause
Troubleshoot Gemini API implicit caching issues: cache not hitting, unexpectedly high costs, or low cache hit rates. Covers token thresholds, prompt structure, model version consistency, TTL expiry, and multimodal caching with code examples.
API / SDK2026-04-19
Gemini API Caching in Production — Operational Notes from an Indie Mobile Developer
Field notes on running Gemini API's Context Caching and Implicit Caching together inside indie mobile apps. Includes working Python code, six months of measured costs from AdMob-funded apps, and seven non-obvious operational pitfalls.
API / SDK2026-04-16
Gemini API × Gemma 4 Hybrid Inference Architecture: A Complete Production Guide to Cutting API Costs by 70%
Learn how to build a hybrid inference architecture combining Gemini API and Gemma 4 local models. Covers request routing design, cost analysis, and production deployment — with complete Python code.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →