GEMINI LABJP
MODEL — Gemini 3.5 Flash is now generally available, beating 3.1 Pro on nearly all benchmarks while running 4x fasterAGENTS — Managed Agents arrive in the Gemini API in public preview, running autonomous agents in isolated Google-hosted Linux sandboxesSEARCH — File Search adds multimodal search, natively embedding and searching images via gemini-embedding-2API — Event-driven webhooks now replace polling for the Batch API and long-running operationsSTUDIO — Google AI Studio builds Android apps from plain language and generates images on the fly with Nano BananaMIGRATION — Gemini CLI reaches end-of-life on June 18; migrate to the Agentic 2.0 CLI (two image-preview models retire June 25)MODEL — Gemini 3.5 Flash is now generally available, beating 3.1 Pro on nearly all benchmarks while running 4x fasterAGENTS — Managed Agents arrive in the Gemini API in public preview, running autonomous agents in isolated Google-hosted Linux sandboxesSEARCH — File Search adds multimodal search, natively embedding and searching images via gemini-embedding-2API — Event-driven webhooks now replace polling for the Batch API and long-running operationsSTUDIO — Google AI Studio builds Android apps from plain language and generates images on the fly with Nano BananaMIGRATION — Gemini CLI reaches end-of-life on June 18; migrate to the Agentic 2.0 CLI (two image-preview models retire June 25)
Articles/API / SDK
API / SDK/2026-06-23Advanced

Generating a Thumbnail From a Video With Nano Banana 2 (gemini-3.1-flash-image)

A hands-on guide to passing a whole video as context to the GA model gemini-3.1-flash-image (Nano Banana 2) and generating a single thumbnail. Covers how it differs from frame extraction, the preview-to-GA migration, and measured cost and time per image.

gemini86gemini-api244nano-banana2image-generation6multimodal39

Premium Article

Cutting thumbnails out of videos by hand was quietly eating my afternoons. Running several app-intro clips and short explainers as an indie developer, each one meant hunting for a "good-looking frame," trimming it, and preparing a base for text — roughly ten minutes per clip before any real work began. Ten clips, and the afternoon was gone.

On June 22, gemini-3.1-flash-image (nicknamed Nano Banana 2) reached GA, and with it the ability to pass a video file itself as multimodal context and generate a thumbnail, poster, or infographic. Instead of picking one frame and handing it over, you let the model read the whole video as context and ask it to "make a still that represents this video." I tried it on a few of my own clips half-expecting noise, and the single image it returned captured the subject of the video better than I expected. Here's the implementation, plus the things that tripped me up putting it into a real workflow.

From "pick a frame" to "pass the video as context"

Until now, making a thumbnail from a video was a two-step affair: I (or ffmpeg) picked a representative frame, then handed that still to a vision model. The weak link is frame selection. The moments a human finds compelling — the peak of motion, the instant text appears — don't reliably surface from brightness or sharpness scores alone.

Video input on gemini-3.1-flash-image collapses that step. Pass the video as context, and the model generates a "symbolic" still informed by the flow over time. Note that it does not return an existing frame — it draws a new image. So the output is a generated image that represents the video's theme, not a real frame from inside it. If you need a faithful copy of live-action footage, ffmpeg frame extraction is the right tool. But for social thumbnails or posters where you just want "a single image that conveys the vibe," generating is dramatically faster in practice.

This split echoes the line I drew in my earlier write-up on generating wallpaper color variants with Gemini 3.2 Flash image output: don't alter live-action, treat generation as generation.

A minimal setup: video in, one image out

The fastest way to get a feel for it is to run it. Upload the video with the Files API, then include its reference as context in an image-generation request.

Upload the video

# pip install google-genai
from google import genai
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
# Upload the video via the Files API (assume a clip of a few dozen seconds)
video = client.files.upload(file="intro_clip.mp4")
 
# Right after upload the file may be PROCESSING; wait for ACTIVE
import time
while video.state.name == "PROCESSING":
    time.sleep(2)
    video = client.files.get(name=video.name)
 
if video.state.name != "ACTIVE":
    raise RuntimeError(f"upload failed: {video.state.name}")
 
print("uploaded:", video.name)  # files/xxxxxxxx

If you skip ahead before the file is ACTIVE, the generation call fails with a "file not usable yet" error. I hit this first, so don't drop the wait loop.

Generate an image with the video as context

from google.genai import types
 
resp = client.models.generate_content(
    model="gemini-3.1-flash-image",  # GA model; do NOT append a -preview suffix
    contents=[
        video,  # pass the video itself as context
        (
            "Generate one thumbnail image that represents this video. "
            "Use a 16:9 aspect ratio, place the main subject in the center, "
            "and leave headroom at the top for a short text overlay. "
            "Convey the mood rather than copying live-action footage."
        ),
    ],
    config=types.GenerateContentConfig(
        response_modalities=["IMAGE"],
    ),
)

Save the generated image

saved = 0
for part in resp.candidates[0].content.parts:
    if getattr(part, "inline_data", None) and part.inline_data.data:
        with open(f"thumb_{saved}.png", "wb") as f:
            f.write(part.inline_data.data)
        saved += 1
 
print(f"saved {saved} image(s)")  # expected: saved 1 image(s)

Three things matter. First, use the GA model name gemini-3.1-flash-image with no -preview. Second, include IMAGE in response_modalities — when you get text back and no image, this setting is almost always the cause. Third, spell out "headroom," "aspect ratio," and "don't copy live-action" in the prompt. If you'll add text later, asking for top headroom saves you work downstream.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
If you've been hand-cutting thumbnails from videos, you'll walk away with code that produces one image from a video today
You'll learn the exact diffs needed to move a preview-dependent pipeline onto the GA gemini-3.1-flash-image without breaking it
You'll get measured per-image cost and time figures so you can estimate your own monthly spend by video count
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-05-25
When gemini-2.5-flash-image Ignores Your Reference Image — Diagnosing Why Nano Banana Returns a Totally Different Picture
A field-tested triage order for the situations where gemini-2.5-flash-image (Nano Banana) silently ignores your reference image, swaps the subject, or refuses to honor your edit instructions. Covers parts ordering, response_modalities, image size, and chat-session pitfalls with runnable code.
API / SDK2026-05-18
Building Automatic Wallpaper Category Classification with Gemini Vision
An indie developer shares how they implemented automatic wallpaper image classification with the Gemini Vision API — including accuracy results, real pitfalls, structured-output tips, and a cost comparison with GPT-4o Vision.
API / SDK2026-05-16
Testing Gemini Vision for Wallpaper Auto-Classification — Real Accuracy Numbers and Pitfalls
An indie developer behind a 50M+ download wallpaper app shares a hands-on Gemini Vision classification experiment — including a first attempt at 67% accuracy and the improvements that brought it to 87%.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →