Pre-Screening Wallpaper App Submissions with Gemini Vision: A Two-Week Field Memo

One morning a "possible policy violation" email arrived from Google Play, and the rest of my day quietly fell apart. The cause was a single image buried in a wallpaper batch that contained a small, trademark-like logo in one corner. My own eyes had walked right past it.

I have been building wallpaper apps as an indie developer since 2014, and the catalog has grown past 50 million cumulative downloads. Yet my pre-submission check had always been manual. Reviewing dozens of images one by one wears down your attention exactly when you can least afford to lose it. So I spent two weeks testing a simple question: if I insert Gemini's image understanding as a first-pass filter before submission, how much of that burden actually goes away? Here is what I measured.

Why manual review stopped being enough

A wallpaper update often ships several dozen images at once, sometimes more than a hundred. The time you can spend per image is tiny, and the things that trip review are never the same twice: copyrighted material, excessive exposure, violent motifs, misleading text. The criteria are plural, and human attention reliably drops during repetitive work.

My grandfather, a temple carpenter, reportedly inspected every piece of timber before joining it. The act of checking with his own hands was, I was told, a kind of devotion. I do not want to treat checking as a chore either, but relying on my eyes alone meant missing the one thing that mattered at the worst moment. That is exactly why I wanted a first-pass filter to back up human attention.

What I asked Gemini to look at

I built a pipeline that hands every image in the submission folder to Gemini and returns a structured "risk: high / review / none" judgment for each predefined criterion. I chose a fast, low-cost Flash model. Because the per-image cost is small, running it across a hundred images is painless.

import google.generativeai as genai
from pydantic import BaseModel
from enum import Enum
 
class Risk(str, Enum):
    none = "none"
    review = "review"
    high = "high"
 
class Verdict(BaseModel):
    trademark_or_logo: Risk
    explicit_content: Risk
    violence: Risk
    misleading_text: Risk
    note: str
 
model = genai.GenerativeModel("gemini-2.5-flash")
 
def screen(image_path: str) -> Verdict:
    img = genai.upload_file(image_path)
    prompt = (
        "Inspect this image from the perspective of a mobile app store reviewer. "
        "Rate four criteria — trademark/logo presence, exposure, violent motifs, "
        "and misleading text — as none / review / high, with a brief reason in note."
    )
    res = model.generate_content(
        [prompt, img],
        generation_config={"response_mime_type": "application/json",
                           "response_schema": Verdict},
    )
    return Verdict.model_validate_json(res.text)

I fixed the output to a JSON schema so I could aggregate it directly downstream. The rule became three-tiered: any single high sends the batch to my own eyes, review gets judged after reading the reason, and none-only images pass through.

Where it pushed back

The first few days surprised me with false positives. The "misleading text" criterion was especially sensitive, repeatedly mistaking decorative English lettering on a wallpaper for a brand name. Once I made the criterion concrete — "mark high only when it is identifiable as a real existing brand" — the noise visibly dropped. Changing behavior by adding a single line to the prompt is part of the quiet joy of working with an API.

What it struggled with was the impression created by a composition as a whole. There was one image where every element was fine individually, yet the combination could be misread by a reviewer. Gemini returned none there, and in the end it was my own unease that led me to swap it out. It is good at decomposing elements, but reading the overall mood is still a human's job.

Effect and limits after two weeks

By the numbers, the volume sent to manual review felt like it dropped by more than half. Because I can pass none-only batches with confidence, I redirect my concentration toward review and high. As a test I re-ran an old batch, and it re-caught the logo-in-the-corner image I had missed back then, flagging it high.

Still, I would not call this "automating review." Gemini catches only what can be decomposed into stated criteria. Store policies update in their details, and judgments involving cultural nuance still rest with me. These apps carry my revenue base, including AdMob, so I have decided not to let go of the final call. The distance that fit best was this: Gemini is not a replacement decision-maker but an assistant that steers my attention to where it is needed.

What I am trying next

Next I am building a loop that accumulates the reason text behind each review verdict and compares it against my own swap decisions. Once I can see which criteria Gemini and I disagree on, I should be able to tune the prompt closer to my own app's standards. For fellow indie developers shipping large volumes of imagery to the stores, I would say image understanding as a first-pass filter has already reached a genuinely practical stage.

Thank you for reading to the end.