Cutting thumbnails out of videos by hand was quietly eating my afternoons. Running several app-intro clips and short explainers as an indie developer, each one meant hunting for a "good-looking frame," trimming it, and preparing a base for text — roughly ten minutes per clip before any real work began. Ten clips, and the afternoon was gone.
On June 22, gemini-3.1-flash-image (nicknamed Nano Banana 2) reached GA, and with it the ability to pass a video file itself as multimodal context and generate a thumbnail, poster, or infographic. Instead of picking one frame and handing it over, you let the model read the whole video as context and ask it to "make a still that represents this video." I tried it on a few of my own clips half-expecting noise, and the single image it returned captured the subject of the video better than I expected. Here's the implementation, plus the things that tripped me up putting it into a real workflow.
From "pick a frame" to "pass the video as context"
Until now, making a thumbnail from a video was a two-step affair: I (or ffmpeg) picked a representative frame, then handed that still to a vision model. The weak link is frame selection. The moments a human finds compelling — the peak of motion, the instant text appears — don't reliably surface from brightness or sharpness scores alone.
Video input on gemini-3.1-flash-image collapses that step. Pass the video as context, and the model generates a "symbolic" still informed by the flow over time. Note that it does not return an existing frame — it draws a new image. So the output is a generated image that represents the video's theme, not a real frame from inside it. If you need a faithful copy of live-action footage, ffmpeg frame extraction is the right tool. But for social thumbnails or posters where you just want "a single image that conveys the vibe," generating is dramatically faster in practice.
This split echoes the line I drew in my earlier write-up on generating wallpaper color variants with Gemini 3.2 Flash image output: don't alter live-action, treat generation as generation.
A minimal setup: video in, one image out
The fastest way to get a feel for it is to run it. Upload the video with the Files API, then include its reference as context in an image-generation request.
Upload the video
# pip install google-genai
from google import genai
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
# Upload the video via the Files API (assume a clip of a few dozen seconds)
video = client.files.upload(file="intro_clip.mp4")
# Right after upload the file may be PROCESSING; wait for ACTIVE
import time
while video.state.name == "PROCESSING":
time.sleep(2)
video = client.files.get(name=video.name)
if video.state.name != "ACTIVE":
raise RuntimeError(f"upload failed: {video.state.name}")
print("uploaded:", video.name) # files/xxxxxxxxIf you skip ahead before the file is ACTIVE, the generation call fails with a "file not usable yet" error. I hit this first, so don't drop the wait loop.
Generate an image with the video as context
from google.genai import types
resp = client.models.generate_content(
model="gemini-3.1-flash-image", # GA model; do NOT append a -preview suffix
contents=[
video, # pass the video itself as context
(
"Generate one thumbnail image that represents this video. "
"Use a 16:9 aspect ratio, place the main subject in the center, "
"and leave headroom at the top for a short text overlay. "
"Convey the mood rather than copying live-action footage."
),
],
config=types.GenerateContentConfig(
response_modalities=["IMAGE"],
),
)Save the generated image
saved = 0
for part in resp.candidates[0].content.parts:
if getattr(part, "inline_data", None) and part.inline_data.data:
with open(f"thumb_{saved}.png", "wb") as f:
f.write(part.inline_data.data)
saved += 1
print(f"saved {saved} image(s)") # expected: saved 1 image(s)Three things matter. First, use the GA model name gemini-3.1-flash-image with no -preview. Second, include IMAGE in response_modalities — when you get text back and no image, this setting is almost always the cause. Third, spell out "headroom," "aspect ratio," and "don't copy live-action" in the prompt. If you'll add text later, asking for top headroom saves you work downstream.