◈ API / SDK/2026-06-15Intermediate

Put Help Docs and Screenshots in One File Search Store and Return Answers That Cite the Image Too

Your text help docs and your screenshots live in separate stores, so a single question can never return both the steps and the matching screen. With gemini-embedding-2 going multimodal in File Search, here is how I merged them and returned the cited screenshot alongside the answer.

Gemini API¹³⁶ File Search² gemini-embedding-2² multimodal³⁸ RAG¹⁰

✦ Premium Article

When you run an app as an indie developer, your support knowledge tends to split into two piles: help articles written as prose, and screenshots of the actual settings screens. Doing support for my wallpaper apps, every time someone asked "where do I restore my purchase?" I would send the written steps, then go dig a screenshot out of a separate folder and paste it in. Two steps, every time.

The real problem is that when text and screenshots sit in separate retrieval systems, a single user question can only pull "prose only" or "image only." What I actually wanted was to return "here are the steps (text) and here is the screen (image)" as one unit. For a while I OCR'd the images into text to force both onto the same playing field, but OCR never captured the visual cues — which icon, where on the screen — that screenshots are good for.

That assumption changed once gemini-embedding-2 started supporting multimodal embeddings in File Search. You can put text documents and image documents into the same store and search them in the same vector space. This is a walk-through, in the order I actually did it, of merging help docs and screenshots into one File Search store and returning answers that cite the source image too.

Why "text and images in separate stores" gives you half an answer

The technical reason is simple: vectors produced by different embedding models can't be compared. If you index text with a text embedding and images with an image embedding, you end up with two separate vector spaces, and "take the nearest neighbors of the query vector" can't cross between them. So you end up querying text search and image search separately, then awkwardly merging the rankings afterward.

That merge was the painful part. Text scores and image scores live on different scales, so no matter how I tuned the thresholds, an asymmetry remained: "the prose is spot on but the attached screenshot is off," or "the image is right but the caption is stale." In my case the support copy felt fine, while only the hit rate of the attached screenshots stayed stubbornly low.

Multimodal embeddings map text and images into the same space. The embedding of "restore my purchase" lands near the embedding of a screenshot showing the restore button, so a single query surfaces both near the top. Because the score scale is unified too, you get to collapse all that downstream threshold logic into one path — which was the biggest practical win for me.

Building a mixed store with gemini-embedding-2

First create the store and pin the embedding model to the multimodal gemini-embedding-2. The key is to fix the embedding model at creation time. If you swap only the model later, your existing vectors and new vectors end up in different spaces, and retrieval quietly degrades.

# pip install google-genai
from google import genai
from google.genai import types
 
client = genai.Client()  # reads GEMINI_API_KEY from the environment
 
# (1) Create a store backed by a multimodal embedding model
store = client.file_search_stores.create(
    config={
        "display_name": "app-support-kb",
        # Pin one model that embeds text and images into the same space
        "embedding_model": "gemini-embedding-2",
    }
)
print(store.name)  # -> fileSearchStores/app-support-kb-xxxxxxxx

Next, load documents. Both the text help article and the screenshot image are simply uploaded to the same store. What matters here is custom_metadata: because I rely on it later to tell "which modality" and "which screen," I always attach it at upload time.

# (2) Upload a text help article
client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=store.name,
    file="docs/restore-purchase.md",
    config={
        "custom_metadata": [
            {"key": "modality", "string_value": "text"},
            {"key": "screen", "string_value": "settings"},
            {"key": "locale", "string_value": "en"},
        ]
    },
)
 
# (3) Upload a screenshot into the same store
client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=store.name,
    file="shots/settings-restore.png",
    config={
        "custom_metadata": [
            {"key": "modality", "string_value": "image"},
            {"key": "screen", "string_value": "settings"},
            {"key": "locale", "string_value": "en"},
        ]
    },
)

Keeping a business-level key like screen consistent lets you later check whether the prose and the screenshot point at the same screen. I skipped this on my first pass and was treated to the restore-steps article being paired with a screenshot of the home screen. Think of the metadata less as something for search and more as something for verifying the consistency of the answer.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦You can mix text and image documents in one File Search store and search across both modalities from a single query

✦You will be able to read the grounding metadata to tell whether a citation is text or an image, and return the matching screenshot with the answer

✦You can take home the production gotchas I hit around image resizing, supported formats, and query-time image tokens

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Make it cite images too: read the modality from grounding metadata

On the generation side, you just hand the model the file_search tool. The important thing is the grounding_metadata in the response, which records "which documents were used as grounding." Since text and images are mixed, you split the citations by the modality you attached earlier.

QUESTION = "I want to restore my purchase. Where do I do that?"
 
resp = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=QUESTION,
    config=types.GenerateContentConfig(
        tools=[types.Tool(
            file_search=types.FileSearch(
                file_search_store_names=[store.name],
            )
        )],
    ),
)
 
print(resp.text)  # the prose answer
 
# Sort citations by modality
text_refs, image_refs = [], []
gm = resp.candidates[0].grounding_metadata
for chunk in (gm.grounding_chunks or []):
    ctx = chunk.retrieved_context
    meta = {m.key: m.string_value for m in (ctx.custom_metadata or [])}
    if meta.get("modality") == "image":
        image_refs.append(ctx.title)   # e.g. settings-restore.png
    else:
        text_refs.append(ctx.title)    # e.g. restore-purchase.md
 
print("text sources:", text_refs)
print("image sources:", image_refs)

Now you have the prose answer and the filenames of the screenshots that grounded it at the same time. The answer UI just renders the image URLs that correspond to image_refs, and you get a reply where the explanation and the screen match. In my support flow this was the turning point: compared with the days of sending prose only, follow-up questions like "I can't find where on the screen" dropped noticeably.

What actually bit me: image preprocessing and tokens

Getting to a clean run involved a few traps. Here are the ones worth clearing before you ship.

First, don't upload giant screenshots as-is. Modern phones are high resolution, and the screenshots I captured for App Store submission are tall and heavy, like 1290x2796. I started by uploading them at full resolution, but retrieval quality held up fine even at roughly thumbnail size — only the upload and indexing time grew. Resizing the long edge to around 1024px before upload strikes the best balance of accuracy and cost in my experience.

from PIL import Image
 
def shrink(path: str, max_side: int = 1024) -> str:
    img = Image.open(path).convert("RGB")
    w, h = img.size
    scale = min(1.0, max_side / max(w, h))
    if scale < 1.0:
        img = img.resize((round(w * scale), round(h * scale)))
    out = path.replace(".png", "-small.jpg")
    img.save(out, "JPEG", quality=85)  # lighter than PNG, fine for UI display
    return out

Second, formats. PNG, JPEG, and WebP go through cleanly, but HEIC — increasingly common from iOS screenshots — was sometimes rejected as-is. Normalizing device-dependent formats to JPEG at the entrance of your ingest pipeline avoids surprises. Making the shrink above double as that normalizing entry point is the pragmatic move.

Third, image tokens at query time. In multimodal retrieval, the fetched images ride along in the model context, so pulling too many burns tokens for nothing. I capped retrieval to the top results and limited image citations to at most two per answer. Narrowing from the start beats "pull everything then throw most away," and latency drops cleanly as a result.

Returning the matching screenshot with the answer

Let's roll all of this into one function, the kind you would call behind a support bot or a contact form. The trick for stable operation is to keep only the screenshots whose screen metadata matches the screen the prose refers to, and silently drop mismatched images.

def answer_with_screenshot(question: str, max_images: int = 2) -> dict:
    resp = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=question,
        config=types.GenerateContentConfig(
            tools=[types.Tool(
                file_search=types.FileSearch(
                    file_search_store_names=[store.name],
                )
            )],
        ),
    )
    gm = resp.candidates[0].grounding_metadata
 
    text_screens, images = set(), []
    for chunk in (gm.grounding_chunks or []):
        ctx = chunk.retrieved_context
        meta = {m.key: m.string_value for m in (ctx.custom_metadata or [])}
        if meta.get("modality") == "text":
            text_screens.add(meta.get("screen"))
 
    for chunk in (gm.grounding_chunks or []):
        ctx = chunk.retrieved_context
        meta = {m.key: m.string_value for m in (ctx.custom_metadata or [])}
        # keep only screenshots whose screen matches what the prose covered
        if meta.get("modality") == "image" and meta.get("screen") in text_screens:
            images.append(ctx.title)
 
    return {"answer": resp.text, "screenshots": images[:max_images]}

Once I added this "only attach screenshots for screens the prose touched" matching condition, stray images basically stopped slipping in. The design idea is to not leave precision entirely to the model, but to tighten things one last notch with business metadata.

How much it helped, and when to actually use it

For my app support, I gathered roughly 40 help articles and around 80 screenshots into a single store. The biggest felt change was the self-sufficiency of the first reply. Previously a reply meant sending prose and then chasing it with a screenshot; with the mixed store, the first reply already carries the screen, so more inquiries needed no extra back-and-forth. Multilingual support works too — the locale metadata splits things by language, so an English inquiry gets English help plus a screenshot of the same screen.

That said, not everything should go multimodal. For a pure-text FAQ where images barely matter, a text-only store indexes lighter and faster. My rule of thumb is "does a meaningful share of answers want a screen attached?" For support and how-to guides, where visual cues pay off, I recommend the mixed store; for prose-dominant corpora like terms of service or pricing tables, I leave them split rather than forcing images in.

One more operational habit: when you update the app UI, swap the screenshots too. Leaving stale screens around produces that confusing mismatch where the prose is current but the attached image is a version behind. I folded "update the screenshots in the store" into the routine I run whenever I ship a new build to the App Store or Google Play.

Your next step

Start by putting just three of your help articles and three matching screenshots into the same store, and send one question to answer_with_screenshot. Whether the prose and the screen come back aligned hinges heavily on how you set the screen metadata. Getting a feel for the matching condition on a small set, then expanding to the full corpus, looks like the long way around but has been the reliable one for me.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.