GEMINI LABJP
API — Event-driven webhooks deliver Batch API and long-running completions, removing the need to pollSEARCH — File Search now supports gemini-embedding-2, embedding and searching images nativelySECURITY — Since June 19, requests from unrestricted API keys are blocked — review your key limitsMODEL — Gemini 3.5 Flash is generally available and now powers gemini-flash-latestAGENT — Managed Agents hit public preview in the Gemini API, running in isolated sandboxesDEPRECATED — Two image-preview models shut down June 25 — check any preview-dependent flowsAPI — Event-driven webhooks deliver Batch API and long-running completions, removing the need to pollSEARCH — File Search now supports gemini-embedding-2, embedding and searching images nativelySECURITY — Since June 19, requests from unrestricted API keys are blocked — review your key limitsMODEL — Gemini 3.5 Flash is generally available and now powers gemini-flash-latestAGENT — Managed Agents hit public preview in the Gemini API, running in isolated sandboxesDEPRECATED — Two image-preview models shut down June 25 — check any preview-dependent flows
Articles/API / SDK
API / SDK/2026-06-28Advanced

When Gemini × Qdrant Hybrid Search Was Quietly Losing Recall — Field Notes on Instrumenting RRF Weights and Sparse-Vector Drift

Run Gemini embeddings with Qdrant hybrid search in production and your dashboards stay green while recall quietly slips. These field notes show how to catch it with measurement — RRF weights, sparse-vector drift, missing payload indexes — and protect it with a quality budget.

Gemini API154QdrantHybrid SearchRAG11RRFRecallProduction31Instrumentation

Premium Article

Search is still fast — the answers just got thinner

One morning I noticed that my support RAG was answering noticeably worse than it had six months earlier. The latency graph was a flat line, Qdrant's health check was green, and the error rate was zero. What had degraded wasn't response time — it was the quality of the documents being retrieved. The right candidates simply weren't surfacing near the top anymore.

This kind of decay never trips an alert. No 500s, no 429s — the system just keeps handing thin context to Gemini. Generation flows smoothly, so a human only feels "this is weak" when they actually read the output. The genuinely scary thing about vector search is that it keeps working even when it's broken. As an indie developer I've run several small support RAGs, and nothing has bitten me later quite like this quiet decay.

These are field notes on running Gemini embeddings with Qdrant hybrid search in production: how I caught the slow recall regression with measurement, and how I tuned RRF weights and sparse-vector design with numbers rather than intuition. Not a general intro — just the parts I couldn't have noticed without measuring.


Put measurement first — make recall visible

Before any precision discussion, I always build a small eval set first. You don't need a pristine dataset. Pull 50–100 queries from real query logs and attach one or a few "if this is retrieved, it's correct" document IDs to each. Just having this lets you say, in numbers rather than guesses, whether a config change was an improvement or a regression.

# eval_set.py — keep the eval set as JSONL (one query per line)
# {"query": "...", "relevant_ids": ["doc_12", "doc_88"]}
import json
 
def load_eval_set(path: str) -> list[dict]:
    with open(path, encoding="utf-8") as f:
        return [json.loads(line) for line in f if line.strip()]
 
def recall_at_k(retrieved_ids: list[str], relevant_ids: list[str], k: int) -> float:
    """How much of the ground truth made the top k (0.0–1.0)."""
    if not relevant_ids:
        return 0.0
    top = set(retrieved_ids[:k])
    hit = sum(1 for r in relevant_ids if r in top)
    return hit / len(relevant_ids)
 
def mrr(retrieved_ids: list[str], relevant_ids: list[str]) -> float:
    """Where the first correct hit landed (reciprocal of its rank)."""
    for rank, doc_id in enumerate(retrieved_ids, start=1):
        if doc_id in relevant_ids:
            return 1.0 / rank
    return 0.0

I run both recall@10 and MRR every time I touch a setting. Recall measures how much you miss; MRR measures whether you place correct hits near the top — rely on one alone and you'll misjudge. In practice, changing RRF weights often raised recall while lowering MRR, a tug of war you only see if you watch both.


Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
The pattern where recall drops while every dashboard stays green, and the measurement code that surfaces it
How to tune RRF k and dense/sparse weights by measuring against a small eval set instead of guessing
The real causes of silent recall loss: sparse-vector drift, missing payload indexes, and mid-flight embedding swaps
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-06-26
When your Gemini API spend cap trips, paying users go down too — isolating the blast radius with per-tier projects
A Project Spend Cap stops the entire project at once. To keep a runaway free tier from taking paying users down with it, this is a design note on isolating the cap's blast radius across per-tier projects and closing the ~10-minute delay with an application-side soft budget gate.
API / SDK2026-06-18
Stop a Batch Before It Overspends — A Budget Gate Built on countTokens That Survives a Default-Model Swap
Nightly batches overspend because you only learn the cost after billing. Starting from countTokens, this guide builds a budget gate that folds in thinking tokens and keeps your estimate intact even when the default model changes underneath you.
API / SDK2026-06-17
Watching the 'Voice' of Generated Text: Catching a Silent Default-Model Swap Through Style Drift
When the default model changes over your head, the output can stay factually correct while its voice quietly shifts. This walks through fingerprinting the style of generated text and detecting drift statistically, with a dependency-free implementation you can drop into your pipeline.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →