◈ API / SDK/2026-06-14Advanced

Controlling Image Tokens with the Gemini API media_resolution Setting — Tuning Batch Image Classification by Measurement

media_resolution, introduced in the Gemini 3 line, switches how many tokens an image input consumes across three levels. Through real batch-classification measurements, this guide shows how to balance cost and accuracy by assigning the right tier per task.

gemini-api²³² media_resolution multimodal³⁷ cost-optimization²³ image-classification² tokens³ production¹⁰⁶

✦ Premium Article

Running a wallpaper app as a solo developer means a job runs every day to auto-categorize newly added images. One day, while looking at the billing breakdown, I noticed it wasn't output or reasoning that was piling up — it was input tokens, far more than I expected, even though I was barely sending any text. The reason was simple: I had been throwing every image at the highest resolution without thinking about how many tokens a single image costs.

media_resolution, introduced in the Gemini 3 line, is exactly the parameter for controlling that per-image token cost. Most cost-optimization writeups focus on caching or model routing, but for multimodal-heavy workloads the input image resolution tier is the single biggest lever you have. In this article, using real measurements from my wallpaper classification pipeline, I walk through how to assign tiers per task without sacrificing cost or accuracy.

What media_resolution Is — The Setting That Decides Image Token Cost

media_resolution controls how many tokens Gemini internally converts an input image or video frame into. Conceptually the value has three levels — low, medium, high. The lower the level, the fewer tokens per image; the higher, the more fine detail the model can read.

The key idea is that this is not a "reduce image quality" setting but a "choose the granularity of the representation handed to the model" setting. Even a coarse tier conveys big-picture features just fine: overall composition, dominant colors, the rough type of subject. But reading small embedded text, or telling apart subtly different patterns, requires a higher tier. In other words, the right tier is determined by what information in the image your task actually needs.

One caveat: the exact token count per tier varies with the model version and the image aspect ratio. Rather than trusting published ballpark figures, it's more reliable to measure on your own workload. The next section builds that harness.

Measure First — Read Per-Tier Tokens via usage_metadata

Before optimizing, capture the current state as numbers. Gemini API responses include usage_metadata, whose prompt_token_count is the input-side token consumption. Send the same image and the same prompt while changing only the tier, and you can compare the tier's effect in isolation.

from google import genai
from google.genai import types
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
MODEL = "gemini-3.5-flash"  # pin the version in production
 
TIERS = {
    "low": types.MediaResolution.MEDIA_RESOLUTION_LOW,
    "medium": types.MediaResolution.MEDIA_RESOLUTION_MEDIUM,
    "high": types.MediaResolution.MEDIA_RESOLUTION_HIGH,
}
 
def measure_tokens(image_bytes: bytes, prompt: str) -> dict[str, int]:
    """Measure input tokens per tier with the same image and prompt."""
    result = {}
    img_part = types.Part.from_bytes(data=image_bytes, mime_type="image/jpeg")
    for name, tier in TIERS.items():
        resp = client.models.generate_content(
            model=MODEL,
            contents=[img_part, prompt],
            config=types.GenerateContentConfig(
                media_resolution=tier,
                temperature=0,
            ),
        )
        result[name] = resp.usage_metadata.prompt_token_count
    return result
 
with open("sample_wallpaper.jpg", "rb") as f:
    print(measure_tokens(f.read(), "Answer this image's category in one word."))

Run this across a handful of representative images and the gap between tiers becomes obvious at a glance. In my pipeline, input tokens per image differed by roughly several times between the low and high tiers (the absolute values shift with model and image size, so always measure in your own environment). Once you're processing thousands of images a day in batch, that multiplier shows up directly on the bill.

Why change only the tier when measuring? Because if you also change the prompt or the output schema at the same time, you can't isolate which factor moved the number. Move one variable at a time. It sounds tedious, but this is the single principle that saved me the most time in cost investigations.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A reproducible harness that measures, via usage_metadata, how the three media_resolution levels change image token consumption

✦A measurement protocol and decision criteria for finding the lowest tier that still preserves accuracy, task by task

✦A record of moving from a flat HIGH setting to per-task assignment and meaningfully shrinking the classification pipeline's input tokens

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Where Accuracy Drops — Measure Sensitivity per Task

Fewer tokens are meaningless if the classification accuracy you care about collapses. So prepare a small labeled validation set and measure accuracy while varying the tier. The crucial point: sensitivity to resolution differs completely depending on the nature of the task.

import json
 
def classify(image_bytes: bytes, tier) -> str:
    img_part = types.Part.from_bytes(data=image_bytes, mime_type="image/jpeg")
    resp = client.models.generate_content(
        model=MODEL,
        contents=[img_part, "Answer with one of: nature, abstract, city, animal, minimal"],
        config=types.GenerateContentConfig(
            media_resolution=tier,
            temperature=0,
            response_mime_type="application/json",
            response_schema={
                "type": "object",
                "properties": {"category": {"type": "string"}},
                "required": ["category"],
            },
        ),
    )
    return json.loads(resp.text)["category"]
 
def accuracy_by_tier(labeled: list[tuple[bytes, str]]) -> dict[str, float]:
    scores = {}
    for name, tier in TIERS.items():
        correct = sum(classify(img, tier) == gold for img, gold in labeled)
        scores[name] = correct / len(labeled)
    return scores

In my validation, "broad category classification" — decided by composition and color — scored nearly the same at the low tier as at the high tier. By contrast, detecting a small watermark burned into an image, or distinguishing the fine differences between very similar geometric patterns, lost visible accuracy as I lowered the tier.

The guidance that follows is simple: tasks whose answers are determined by the big picture can use a low tier; tasks that require reading fine detail need a high tier. And because that boundary shifts with your category taxonomy and image tendencies, don't decide it by assumption — always confirm it against a validation set.

Before / After — From Flat HIGH to Per-Task Assignment

My original code processed every image uniformly at the highest resolution. It's easy to implement, but it meant paying top-tier tokens even for tasks that the big picture already answers.

# Before: process every task uniformly at HIGH (easy, but input tokens are excessive)
def classify_all(image_bytes: bytes, prompt: str) -> str:
    img_part = types.Part.from_bytes(data=image_bytes, mime_type="image/jpeg")
    resp = client.models.generate_content(
        model=MODEL,
        contents=[img_part, prompt],
        config=types.GenerateContentConfig(
            media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH,
        ),
    )
    return resp.text

Once validation tells you "how low each task can go," assign a tier per task type.

# After: assign the minimum viable tier per task type
TASK_TIER = {
    "category": types.MediaResolution.MEDIA_RESOLUTION_LOW,    # decided by big picture
    "color_tags": types.MediaResolution.MEDIA_RESOLUTION_LOW,  # dominant color extraction
    "watermark": types.MediaResolution.MEDIA_RESOLUTION_HIGH,  # needs fine detail
    "pattern_dedup": types.MediaResolution.MEDIA_RESOLUTION_MEDIUM,
}
 
def classify_for(task: str, image_bytes: bytes, prompt: str) -> str:
    tier = TASK_TIER.get(task, types.MediaResolution.MEDIA_RESOLUTION_MEDIUM)
    img_part = types.Part.from_bytes(data=image_bytes, mime_type="image/jpeg")
    resp = client.models.generate_content(
        model=MODEL,
        contents=[img_part, prompt],
        config=types.GenerateContentConfig(media_resolution=tier),
    )
    return resp.text

With this switch, the "broad category classification" that makes up most of my pipeline moved to the low tier, and I cut that stage's input tokens substantially while keeping the validation-set accuracy. The highest tier stays only on the small set of tasks that genuinely need detail. The real nature of cost reduction wasn't "trim everything uniformly" but "pay for a high tier only where it's worth paying."

Handling Mixed Needs — Splitting Requests as the Practical Answer

For a single image, you may have several simultaneous demands: "the broad category can be coarse, but I want to look closely for a watermark." The naive move is to do it all in one request, but media_resolution sets the image resolution for the entire request, so it gets pulled up to the strictest task and the whole thing runs at a high tier.

The practical answer I settled on is to split requests by the granularity of the demand. First do the cheap broad classification at a low tier, then send only the images flagged as "likely to have a watermark" through a high-tier inspection. Most images are finalized in the cheap first stage, so the number of high-tier calls drops dramatically overall.

def two_stage(image_bytes: bytes) -> dict:
    # Stage 1: low tier for broad classification and "does this need detail?"
    coarse = classify_for("category", image_bytes, "Classify the category, and judge whether text or a watermark may be present.")
    needs_detail = "watermark_suspected" in coarse
    result = {"coarse": coarse}
    # Stage 2: high tier only on the images that need it
    if needs_detail:
        result["detail"] = classify_for("watermark", image_bytes, "Describe whether a watermark is present and where.")
    return result

This "cheap and wide, then expensive and narrow" two-stage shape also pairs well with carving heavy reasoning out into a later verification or grading step, and it became my default form for multimodal cost design.

Pitfalls I Hit in Operation, and How I Handled Them

A few things tripped me up when actually switching over. First, changing the tier also changes how granular the model's output is, so always run the parsing side through validation. The moment I dropped to a low tier, category names occasionally varied in wording; pinning the structure with response_schema stabilized it.

Second, pin the model version. During the window when the default model is being raised, the same tier can shift in token conversion and output tendencies. Specify the version explicitly, and when you move up, re-measure against the validation set before migrating.

Third, measure on data close to production. Even if a low tier is perfect on clean sample images, real user submissions vary in composition and brightness. Deliberately mix "hard-to-judge real data" into your validation set to prevent accuracy from dropping in production.

Which Tier, When — My Assignment Guidance

To summarize, here are the criteria I use in operation. Tasks answered by the big picture (composition, dominant color, rough subject) start from the low tier. Tasks needing to tell apart near-identical items, or to read fine text and watermarks, go to the high tier. In between, tasks like pattern-duplicate detection that "need some detail but not the maximum" sit at the medium tier. And every boundary is fixed by validation-set accuracy, not by assumption.

As a next step, run this article's measurement harness on a few dozen of your own representative images and put the per-tier "tokens" and "accuracy" into a single table. With that table, you can see in numbers how far each task can be lowered. The cost-versus-quality tradeoff is a domain you can decide by measurement rather than intuition.

If you run image-heavy workloads too, I hope this gives you a reason to revisit those hard-to-see input tokens.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.