Letting File Search's Multimodal Mode Find Wallpapers I Couldn't: A Field Report

"That sunset-over-the-ocean wallpaper with the purple grading — which folder was it in?" Last week, while assembling a featured collection for one of my wallpaper apps, I spent over ten minutes hunting for that single image. The category looked right. I walked the tags. Still nothing. The reason turned out to be mundane: the image had been filed under "sky," not "ocean."

As an indie developer, I maintain several thousand wallpaper assets across my apps. For organizing them, I rely on automatic classification into 30 categories with Gemini Vision — I wrote about that setup in my earlier field report on auto-classifying wallpapers with Gemini Vision, and the accuracy has held up well. But classification means putting each image into exactly one box. A nuance that spans boxes — "sunset" and "ocean" and "purple" — slips right through the structure.

Then came the news that File Search now supports native image embedding and retrieval via gemini-embedding-2. If a natural-language query can search the images themselves, maybe this cross-box treasure hunt finally goes away. Here is what I learned from a 300-image trial.

Classification wasn't broken — so why did I want search?

To be honest, category-based browsing works fine for end users. The pain was entirely on the operations side.

Curating featured pages, picking source material for store screenshots, swapping seasonal campaigns — for these tasks I search by impression, not by category. "A quiet, cool-toned morning." "A city night view with an open line of sight." A vocabulary like that doesn't compress into 30 categories, so I would end up scrolling thumbnail grids by eye. Five to fifteen minutes per hunt, several times a month.

Adding more tags was never a real option. The finer a tag taxonomy gets, the less consistently it gets applied, and the maintenance cost quietly devours whatever the search gains. That's not theory — it's what managing thousands of images has taught me. What I wanted was search without designing a taxonomy at all.

Getting 300 images into a store and searchable

I used the Python SDK (google-genai). There are only three moves: create a store, import images, query.

First, creating the store and uploading. One rule I now treat as non-negotiable: always attach an asset ID in custom_metadata so results can be joined back to your own asset database.

import pathlib
from google import genai
 
client = genai.Client()  # reads GEMINI_API_KEY from the environment
 
# Create a trial File Search store
store = client.file_search_stores.create(
    config={"display_name": "wallpaper-assets-trial"}
)
 
# Import the 300 images from the June 2026 intake batch
for path in sorted(pathlib.Path("./assets/2026-06").glob("*.jpg")):
    client.file_search_stores.upload_to_file_search_store(
        file_search_store_name=store.name,
        file=str(path),
        config={
            "display_name": path.stem,
            "custom_metadata": [
                {"key": "asset_id", "string_value": path.stem},
            ],
        },
    )

Why this matters: search results come back as retrieval chunks, and without an ID in the metadata you have no reliable way to map a hit back to the file in your own records. I imported my first few dozen images without IDs and had to redo them.

Querying is just a regular generate_content call with File Search passed as a tool.

from google.genai import types
 
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Find wallpapers of a sunset over the ocean with a purple-leaning tone",
    config=types.GenerateContentConfig(
        tools=[
            types.Tool(
                file_search=types.FileSearch(
                    file_search_store_names=[store.name]
                )
            )
        ]
    ),
)
 
# For an internal tool, the grounding chunks matter more than the prose
meta = response.candidates[0].grounding_metadata
for chunk in meta.grounding_chunks:
    print(chunk.retrieved_context.title)  # join against custom_metadata

When you're building an operations tool, what you actually want is the list of matching images, not the model's narrative answer. Treating grounding_metadata as the primary output and the response text as garnish made the whole implementation much more straightforward.

Three walls I hit

First, indexing lag. An image is not searchable the moment you upload it. In my 300-image trial, full availability took a few minutes — seven at the longest. That's fine for a batch-ingest-tonight, use-tomorrow workflow, but it breaks the manual habit of "upload, then immediately check." I ended up appending a wait step to my ingest script that fires a representative query and confirms the new batch is retrievable before exiting.

Second, mixing semantic and exact conditions. Embedding search is great at "purple-ish sunset over the ocean," but hard constraints like "aspect ratio 19.5:9 or taller" or "ingested before 2024" are simply not its job. Rather than force it, I settled on a two-stage design: File Search proposes semantic candidates, and my own asset database filters by resolution and intake date. Metadata filters exist on the File Search side too, but for numeric range filtering I trust my own database more.

Third, the cost model — which I actually came to like. File Search doesn't charge for query-time embeddings; you pay for embedding work at indexing time. In other words, you pay once at ingest, and searching afterward stays cheap. For 300 images the indexing cost was pocket change, but before committing thousands of assets I had to ask: do dormant images that nobody will ever search for belong in the store? My answer is no — active assets go in, retired ones get deleted from the store.

I'm keeping classification — the two solve different problems

Before the trial, part of me suspected semantic search might retire the category pipeline altogether. After running 20 test queries — 17 of which surfaced the intended image near the top — my conclusion is to keep both. They occupy different roles.

Where category classification wins: end-user browsing. The stability of the "boxes" is itself the value; in a screen people open daily, predictability builds trust
Where multimodal search wins: operations-side hunting. One-off queries whose vocabulary changes every time. Reducing taxonomy maintenance to zero is the real payoff
Where they meet: recurring search vocabulary ("sunset," "cool tones") becomes evidence for the next category redesign, or for the vocabulary of an in-app search feature

And that purple sunset from the opening? File Search returned it as the top hit on my very first query. Ten minutes of squinting at thumbnails collapsed into seconds, and I felt a quiet surge of warmth watching it happen.

Next steps

I'm not importing the full archive yet. The plan is to pipe only new intake batches into the store, log how many minutes of real work the search replaces each month, and decide from the numbers. Semantic search feels convenient in a way that's easy to overtrust — without measurements it risks becoming a standing cost with an unverified benefit.

If you maintain image assets that tags never quite capture, I'd suggest starting with a small store of a few hundred images. Once you account for indexing lag and metadata design, the whole trial fits in half a day. I hope this record is useful to anyone wrestling with the same problem.