◈ API / SDK/2026-06-23Advanced

Generating a Thumbnail From a Video With Nano Banana 2 (gemini-3.1-flash-image)

A hands-on guide to passing a whole video as context to the GA model gemini-3.1-flash-image (Nano Banana 2) and generating a single thumbnail. Covers how it differs from frame extraction, the preview-to-GA migration, and measured cost and time per image.

gemini⁸⁶ gemini-api²⁴⁴ nano-banana² image-generation⁶ multimodal³⁹

✦ Premium Article

Cutting thumbnails out of videos by hand was quietly eating my afternoons. Running several app-intro clips and short explainers as an indie developer, each one meant hunting for a "good-looking frame," trimming it, and preparing a base for text — roughly ten minutes per clip before any real work began. Ten clips, and the afternoon was gone.

On June 22, gemini-3.1-flash-image (nicknamed Nano Banana 2) reached GA, and with it the ability to pass a video file itself as multimodal context and generate a thumbnail, poster, or infographic. Instead of picking one frame and handing it over, you let the model read the whole video as context and ask it to "make a still that represents this video." I tried it on a few of my own clips half-expecting noise, and the single image it returned captured the subject of the video better than I expected. Here's the implementation, plus the things that tripped me up putting it into a real workflow.

From "pick a frame" to "pass the video as context"

Until now, making a thumbnail from a video was a two-step affair: I (or ffmpeg) picked a representative frame, then handed that still to a vision model. The weak link is frame selection. The moments a human finds compelling — the peak of motion, the instant text appears — don't reliably surface from brightness or sharpness scores alone.

Video input on gemini-3.1-flash-image collapses that step. Pass the video as context, and the model generates a "symbolic" still informed by the flow over time. Note that it does not return an existing frame — it draws a new image. So the output is a generated image that represents the video's theme, not a real frame from inside it. If you need a faithful copy of live-action footage, ffmpeg frame extraction is the right tool. But for social thumbnails or posters where you just want "a single image that conveys the vibe," generating is dramatically faster in practice.

This split echoes the line I drew in my earlier write-up on generating wallpaper color variants with Gemini 3.2 Flash image output: don't alter live-action, treat generation as generation.

A minimal setup: video in, one image out

The fastest way to get a feel for it is to run it. Upload the video with the Files API, then include its reference as context in an image-generation request.

Upload the video

# pip install google-genai
from google import genai
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
# Upload the video via the Files API (assume a clip of a few dozen seconds)
video = client.files.upload(file="intro_clip.mp4")
 
# Right after upload the file may be PROCESSING; wait for ACTIVE
import time
while video.state.name == "PROCESSING":
    time.sleep(2)
    video = client.files.get(name=video.name)
 
if video.state.name != "ACTIVE":
    raise RuntimeError(f"upload failed: {video.state.name}")
 
print("uploaded:", video.name)  # files/xxxxxxxx

If you skip ahead before the file is ACTIVE, the generation call fails with a "file not usable yet" error. I hit this first, so don't drop the wait loop.

Generate an image with the video as context

from google.genai import types
 
resp = client.models.generate_content(
    model="gemini-3.1-flash-image",  # GA model; do NOT append a -preview suffix
    contents=[
        video,  # pass the video itself as context
        (
            "Generate one thumbnail image that represents this video. "
            "Use a 16:9 aspect ratio, place the main subject in the center, "
            "and leave headroom at the top for a short text overlay. "
            "Convey the mood rather than copying live-action footage."
        ),
    ],
    config=types.GenerateContentConfig(
        response_modalities=["IMAGE"],
    ),
)

Save the generated image

saved = 0
for part in resp.candidates[0].content.parts:
    if getattr(part, "inline_data", None) and part.inline_data.data:
        with open(f"thumb_{saved}.png", "wb") as f:
            f.write(part.inline_data.data)
        saved += 1
 
print(f"saved {saved} image(s)")  # expected: saved 1 image(s)

Three things matter. First, use the GA model name gemini-3.1-flash-image with no -preview. Second, include IMAGE in response_modalities — when you get text back and no image, this setting is almost always the cause. Third, spell out "headroom," "aspect ratio," and "don't copy live-action" in the prompt. If you'll add text later, asking for top headroom saves you work downstream.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If you've been hand-cutting thumbnails from videos, you'll walk away with code that produces one image from a video today

✦You'll learn the exact diffs needed to move a preview-dependent pipeline onto the GA gemini-3.1-flash-image without breaking it

✦You'll get measured per-image cost and time figures so you can estimate your own monthly spend by video count

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

What I watched for moving from preview to GA

The part that took the most care this time wasn't the new code — it was the migration. gemini-3.1-flash-image-preview and gemini-3-pro-image-preview are scheduled to shut down on June 25. Leave a preview reference in place and your pipeline goes quiet on that date.

Here's what I checked:

Drop -preview from the model name. I swept hardcoded strings with grep -rn "image-preview" . and consolidated them into a single constant.
Confirm the response shape against real output. Anywhere you parse on schema assumptions can differ between preview and GA.
Embed the shutdown date in CI as a deadline. Dated deprecations cause incidents on the exact day when overlooked.

I wrote up this "shutdown-date-driven migration" approach in more detail in a migration checklist for the June 25 image preview model shutdown. With a hard deadline, separating what changed from what stayed the same and recording it as you go ends up faster than rushing to fix everything at once — that's been my experience.

Cost and time, roughly measured

The real question is "how much per video?" Here are approximate figures from running about 20 clips of a few dozen seconds each. Pricing changes, so confirm exact numbers on the official pricing page.

Method	Work per clip	Rough cost per image	Felt time
Manual (ffmpeg + hand-pick + base)	Human selects and trims a frame	$0 (but your time)	~8-10 min/clip
Frame extract + vision	Auto-pick a frame, describe it	Image input tokens	~20-40 sec/clip
Video to image (this article)	Pass the video, generate one	roughly $0.03-$0.07/image	~15-30 sec/clip

Video consumes more input tokens than an image, so passing long videos as-is drives cost up. I trimmed long footage to the key 30-60 seconds before sending, which kept each image under about 7 cents. Even 20 clips landed in the few-dollars range, turning an afternoon-long chore into roughly ten minutes. The biggest win was simply that the manual time disappeared — about a 30x reduction in hands-on effort per clip.

Pitfalls I hit in real use

The minimal setup runs, but production surfaced a few caveats.

Aspect ratio drifts when left to the prompt. Even with "16:9," it occasionally returns something closer to square. Assume downstream cropping, specify "prioritize 16:9" firmly, and route unexpected sizes to a regeneration retry — that was the stable structure.
Long videos make you wait on upload and processing. The Files API needs to reach ACTIVE, and longer clips wait longer. Trimming to the essentials, as above, helps both cost and latency.
The generated image is not live-action. For anything that needs a person or product to look exactly right, don't ship the generated image as official material. Treat it strictly as a mood-conveying thumbnail.
Empty image? Suspect the modality config. It was almost always a missing response_modalities=["IMAGE"], or a lingering -preview so the model wasn't found and only an error string came back.

Routing the unexpected into regeneration is, I think, the baseline pattern for putting any generative step into production. Don't aim for perfect on the first try; judge the output, and if it fails, rebuild it.

Where it pays to slot this in

For my own split, I settled on generation for thumbnails and posters that don't need faithful reproduction, and traditional frame extraction for assets that need live-action accuracy. Social video thumbnails are the former, so Nano Banana 2's video-to-image fits best there. Conversely, images that need pixel accuracy — like app screenshots — are safer kept away from generation.

To run it routinely, keep one short pipeline ready for when a video finishes: "upload, generate one, regenerate once if not 16:9, save." Thumbnails stop being a thing you wait on. The more channels you run in parallel as an indie developer, the more this front-end automation earns its keep.

Start with a single 30-second clip and run the minimal setup above as-is. Checking with your own eyes whether the returned image captures the video's theme is the quickest way to judge whether it suits your use case.

Thanks for reading. I hope it helps anyone else losing time to the front-end work around video.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.