GEMINI LABJP
FLASH GA — Gemini 3.5 Flash is now generally available, billed as the most intelligent model for sustained frontier performance on agentic and coding tasksTOGGLE — From Jun 16 the Gemini 3.5 Flash feature toggle is removed in the Global, US, and EU multi-regions, so check any configs that depend on itAGENTS — Managed Agents launched in public preview, letting developers build and deploy autonomous, stateful agents inside Google-hosted isolated Linux sandboxesIMAGE — The image preview models gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25; migrate to their successorsSEARCH — File Search now supports multimodal search, natively embedding and searching images via the gemini-embedding-2 modelCLI — Gemini CLI and Code Assist end individual access on Jun 18; free users and AI Pro/Ultra subscribers are directed to the Antigravity CLIFLASH GA — Gemini 3.5 Flash is now generally available, billed as the most intelligent model for sustained frontier performance on agentic and coding tasksTOGGLE — From Jun 16 the Gemini 3.5 Flash feature toggle is removed in the Global, US, and EU multi-regions, so check any configs that depend on itAGENTS — Managed Agents launched in public preview, letting developers build and deploy autonomous, stateful agents inside Google-hosted isolated Linux sandboxesIMAGE — The image preview models gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25; migrate to their successorsSEARCH — File Search now supports multimodal search, natively embedding and searching images via the gemini-embedding-2 modelCLI — Gemini CLI and Code Assist end individual access on Jun 18; free users and AI Pro/Ultra subscribers are directed to the Antigravity CLI
Articles/API / SDK
API / SDK/2026-06-15Intermediate

Put Help Docs and Screenshots in One File Search Store and Return Answers That Cite the Image Too

Your text help docs and your screenshots live in separate stores, so a single question can never return both the steps and the matching screen. With gemini-embedding-2 going multimodal in File Search, here is how I merged them and returned the cited screenshot alongside the answer.

Gemini API136File Search2gemini-embedding-22multimodal38RAG10

Premium Article

When you run an app as an indie developer, your support knowledge tends to split into two piles: help articles written as prose, and screenshots of the actual settings screens. Doing support for my wallpaper apps, every time someone asked "where do I restore my purchase?" I would send the written steps, then go dig a screenshot out of a separate folder and paste it in. Two steps, every time.

The real problem is that when text and screenshots sit in separate retrieval systems, a single user question can only pull "prose only" or "image only." What I actually wanted was to return "here are the steps (text) and here is the screen (image)" as one unit. For a while I OCR'd the images into text to force both onto the same playing field, but OCR never captured the visual cues — which icon, where on the screen — that screenshots are good for.

That assumption changed once gemini-embedding-2 started supporting multimodal embeddings in File Search. You can put text documents and image documents into the same store and search them in the same vector space. This is a walk-through, in the order I actually did it, of merging help docs and screenshots into one File Search store and returning answers that cite the source image too.

Why "text and images in separate stores" gives you half an answer

The technical reason is simple: vectors produced by different embedding models can't be compared. If you index text with a text embedding and images with an image embedding, you end up with two separate vector spaces, and "take the nearest neighbors of the query vector" can't cross between them. So you end up querying text search and image search separately, then awkwardly merging the rankings afterward.

That merge was the painful part. Text scores and image scores live on different scales, so no matter how I tuned the thresholds, an asymmetry remained: "the prose is spot on but the attached screenshot is off," or "the image is right but the caption is stale." In my case the support copy felt fine, while only the hit rate of the attached screenshots stayed stubbornly low.

Multimodal embeddings map text and images into the same space. The embedding of "restore my purchase" lands near the embedding of a screenshot showing the restore button, so a single query surfaces both near the top. Because the score scale is unified too, you get to collapse all that downstream threshold logic into one path — which was the biggest practical win for me.

Building a mixed store with gemini-embedding-2

First create the store and pin the embedding model to the multimodal gemini-embedding-2. The key is to fix the embedding model at creation time. If you swap only the model later, your existing vectors and new vectors end up in different spaces, and retrieval quietly degrades.

# pip install google-genai
from google import genai
from google.genai import types
 
client = genai.Client()  # reads GEMINI_API_KEY from the environment
 
# (1) Create a store backed by a multimodal embedding model
store = client.file_search_stores.create(
    config={
        "display_name": "app-support-kb",
        # Pin one model that embeds text and images into the same space
        "embedding_model": "gemini-embedding-2",
    }
)
print(store.name)  # -> fileSearchStores/app-support-kb-xxxxxxxx

Next, load documents. Both the text help article and the screenshot image are simply uploaded to the same store. What matters here is custom_metadata: because I rely on it later to tell "which modality" and "which screen," I always attach it at upload time.

# (2) Upload a text help article
client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=store.name,
    file="docs/restore-purchase.md",
    config={
        "custom_metadata": [
            {"key": "modality", "string_value": "text"},
            {"key": "screen", "string_value": "settings"},
            {"key": "locale", "string_value": "en"},
        ]
    },
)
 
# (3) Upload a screenshot into the same store
client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=store.name,
    file="shots/settings-restore.png",
    config={
        "custom_metadata": [
            {"key": "modality", "string_value": "image"},
            {"key": "screen", "string_value": "settings"},
            {"key": "locale", "string_value": "en"},
        ]
    },
)

Keeping a business-level key like screen consistent lets you later check whether the prose and the screenshot point at the same screen. I skipped this on my first pass and was treated to the restore-steps article being paired with a screenshot of the home screen. Think of the metadata less as something for search and more as something for verifying the consistency of the answer.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
You can mix text and image documents in one File Search store and search across both modalities from a single query
You will be able to read the grounding metadata to tell whether a citation is text or an image, and return the matching screenshot with the answer
You can take home the production gotchas I hit around image resizing, supported formats, and query-time image tokens
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-06-12
Letting File Search's Multimodal Mode Find Wallpapers I Couldn't: A Field Report
I tested whether File Search's new multimodal retrieval (gemini-embedding-2) could replace category tags for finding one wallpaper among thousands. A 300-image trial, the walls I hit, and where semantic search actually fits — with working code.
API / SDK2026-05-15
3 Gemini API Embedding Errors I Hit Building a Wallpaper App — and How I Fixed Them
Three real Gemini API Embedding errors encountered while building an auto-categorization feature for a wallpaper app with 50M+ downloads: INVALID_ARGUMENT, RESOURCE_EXHAUSTED 429, and poor RAG precision — with working code fixes.
API / SDK2026-05-05
Choosing the Right Gemini RAG Pattern in 2026 — Simple vs Advanced vs Agentic, Compared with Real Code
Compare three RAG implementation patterns with the Gemini API — Simple, Advanced, and Agentic — using real code examples. Learn which pattern fits your use case and where to start.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →