GEMINI LABJP
API — Event-driven webhooks deliver Batch API and long-running completions, removing the need to pollSEARCH — File Search now supports gemini-embedding-2, embedding and searching images nativelySECURITY — Since June 19, requests from unrestricted API keys are blocked — review your key limitsMODEL — Gemini 3.5 Flash is generally available and now powers gemini-flash-latestAGENT — Managed Agents hit public preview in the Gemini API, running in isolated sandboxesDEPRECATED — Two image-preview models shut down June 25 — check any preview-dependent flowsAPI — Event-driven webhooks deliver Batch API and long-running completions, removing the need to pollSEARCH — File Search now supports gemini-embedding-2, embedding and searching images nativelySECURITY — Since June 19, requests from unrestricted API keys are blocked — review your key limitsMODEL — Gemini 3.5 Flash is generally available and now powers gemini-flash-latestAGENT — Managed Agents hit public preview in the Gemini API, running in isolated sandboxesDEPRECATED — Two image-preview models shut down June 25 — check any preview-dependent flows
Articles/API / SDK
API / SDK/2026-06-28Advanced

Mixing Text and Images in One File Search Skewed My Results Toward Images — Rebalancing by Modality After Retrieval

When you put text and images in a single File Search store with gemini-embedding-2, results can quietly skew toward one modality. Here is how to measure that skew and even it out after retrieval, using per-modality normalization and quota-based merging — with working code.

gemini-api253file-search3multimodal40search2python95production125

Premium Article

One morning, as an indie developer, I wanted to search across the Dolice Labs help articles, so I dropped the text articles, the app screenshots, and a pile of wallpaper assets into a single File Search store. Now that gemini-embedding-2 can embed text and images into the same space, this was supposed to give me one index I could query with words or with pictures.

Then I asked, "where do I configure billing," and the top of the list was nothing but screenshots and wallpapers. Of the eight hits, only two were the text articles I actually wanted.

The embeddings were working. Images were retrievable by meaning. And yet the answer I needed got pushed under a heap of images. This was less a precision problem than a quirk specific to mixed stores: the modality composition of the retrieved set skews. Today I want to write down how I measured that skew and evened it out after retrieval, with the implementation.

Why a mixed store leans toward one modality

The cause is that text and images have different similarity-score distributions.

Against the same query vector, the cosine similarity of image chunks can come out systematically higher — and more tightly packed — than text chunks. This is not about one being better than the other. It is a scale mismatch that arises because each modality clusters differently inside the embedding space.

If you naively rank everything on one score and take the top eight, the modality with the slightly higher average eats the slots. In my store the image side scored a touch higher, so images devoured the slots. With a different store, text could just as easily crowd out images the same way.

In other words, the problem is slicing two distributions mixed onto one number line. Level the playing field before you allocate — that becomes the spine of the design.

First, measure the skew — compute the modality mix

Before fixing anything, turn the skew into a number. It is a tiny measurement: decide whether each hit is text or image, then look at the composition of the top N.

File Search grounding metadata includes a media_id and page numbers for visual citations. I simply treat "a chunk with a media_id (or an image reference) as image modality, and a chunk with text body as text modality." Adjust the field names to match your SDK version.

from dataclasses import dataclass
 
@dataclass
class Hit:
    id: str
    score: float          # the similarity score the store returned
    modality: str         # "text" or "image"
    text: str = ""        # the text chunk body, if any
 
def detect_modality(chunk) -> str:
    """Infer the modality from a grounding chunk.
    Adjust the SDK field names to your environment."""
    media_id = getattr(chunk, "media_id", None)
    if media_id:
        return "image"
    # Fallback if image references live in a different field
    if getattr(chunk, "image_uri", None) or getattr(chunk, "inline_image", None):
        return "image"
    return "text"
 
def modality_mix(hits, top_n=8):
    top = hits[:top_n]
    counts = {"text": 0, "image": 0}
    for h in top:
        counts[h.modality] = counts.get(h.modality, 0) + 1
    total = max(1, len(top))
    return {k: (v, round(v / total, 2)) for k, v in counts.items()}
 
# Example: measure the raw top 8 right after retrieval
# print(modality_mix(raw_hits, top_n=8))
# => {'text': (2, 0.25), 'image': (6, 0.75)}  ← a text question, yet 3/4 images

The gap between the "composition you expected" and the "composition you actually got" is what you are fixing. If a question looking for text help is 75% images, it is worth leveling the field.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Reframe why a mixed text+image File Search leans toward one modality as a difference in score distributions, and learn to measure the skew in your own store with numbers
Get working code for per-modality score normalization and quota-based merging so the composition of your retrieved set stays the way you intend
Add a lightweight query-intent check that shifts the quota, so both image-leaning and text-leaning questions hold up without collapsing
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-04-11
Building a Production Content Moderation System with Gemini API: A
A complete guide to building a production-grade content moderation system with the Gemini API. Covers custom safety criteria, multimodal inspection of text and images, async batch processing, Human-in-the-Loop workflows, and cost optimization.
API / SDK2026-06-28
A Promotion Gate So gemini-flash-latest Flipping to 3.5 Flash Doesn't Break Your Pipeline at 3 AM
Floating aliases like gemini-flash-latest swap their target on every GA, quietly shifting the assumptions your unattended automation depends on. Here is a role-to-pinned-ID indirection layer, an acceptance harness that measures four metrics against your own golden set, and threshold-driven promotion and automatic rollback — with working code.
API / SDK2026-06-25
The Morning a Preview Image Model Went Dark — Migrating to GA Gemini Image Models and Building a Deprecation-Resilient Pipeline
With gemini-3.1-flash-image-preview and gemini-3-pro-image-preview retired, here is how to migrate to the GA models and design an image pipeline that no longer gets caught off guard by deprecation dates — with code and cost math, plus video-to-image thumbnail automation.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →