You screen-record a few minutes of app interaction, then later go hunting: "what minute and second did I change that setting?" As an indie developer, I've melted a lot of time on that scrubbing every time I record a bug repro. Video is information-dense, which makes it the slowest medium to search by eye.
Gemini API video understanding turns this into "search that answers with a time." Hand it the whole video and ask "list where this action happens, in minutes and seconds," and it returns the timestamps of the matching scenes. Naively throwing a long video at it, though, balloons the token count, slows the response, and makes cost unpredictable. Below, we build, in runnable form, an implementation that pulls just the scenes you need with timestamps, plus a design that curbs consumption with FPS and resolution.
First, hand the video to the Files API
Videos over 20MB can't be sent inline, so upload to the Files API and reference it. Right after upload, the server is still PROCESSING, so wait until it's ACTIVE before using it. Skip the wait and you get a failed precondition-style error.
import time
from google import genai
client = genai.Client() # reads GEMINI_API_KEY from the environment
def upload_and_wait(path: str, timeout: float = 300.0):
f = client.files.upload(file=path)
start = time.time()
while f.state.name == "PROCESSING":
if time.time() - start > timeout:
raise TimeoutError(f"processing timed out: {f.name}")
time.sleep(3)
f = client.files.get(name=f.name)
if f.state.name != "ACTIVE":
raise RuntimeError(f"upload not active: {f.state.name}")
return f
video = upload_and_wait("app_demo.mp4")
print("ready:", video.name, video.state.name)Forget to re-fetch state with client.files.get and the loop spins forever on the PROCESSING value you grabbed first. Uploaded files auto-delete after a few days by default, so deleting them explicitly with client.files.delete(name=...) when you're done prevents both storage buildup and mix-ups.
Make it answer with timestamps
Now the main event. In the prompt, explicitly ask for answers "with the time in MM:SS format," and instruct structured output so you can feed the result straight into processing. Gemini understands the video's timeline, so it can answer time-dependent questions like "the first moment X appears on screen."
from google.genai import types
prompt = """From this screen recording, list the time each event first occurred in MM:SS.
- the moment the settings screen opened
- the moment the save button was pressed
- the moment the error dialog appeared
For each, return time(MM:SS) and what (a description under 40 chars).
Set time to null for events that don't occur."""
schema = {
"type": "array",
"items": {
"type": "object",
"properties": {
"event": {"type": "string"},
"time": {"type": "string", "nullable": True},
"what": {"type": "string"},
},
"required": ["event", "what"],
},
}
resp = client.models.generate_content(
model="gemini-flash-latest",
contents=[video, prompt],
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=schema,
),
)
for row in resp.parsed:
print(f'{row.get("time") or "--:--"} {row["event"]}: {row["what"]}')The key is fixing the time format to MM:SS in the prompt and further constraining the array shape with response_schema. With that, you can pass the returned time straight to ffmpeg -ss as a seek start, or cut a thumbnail of the matching frame — wiring it into downstream automation. The fast, cheap gemini-flash-latest (the GA 3.5 Flash line) is plenty for this in practice. Preprocessing like video understanding is exactly what I default to routing to the Flash line, where speed and cost pay off.
Curb tokens with FPS and resolution
The longer the clip, the more the sampling settings matter. Gemini ingests video at roughly one frame per second by default, but lowering fps in video_metadata reduces the frames it takes in, dropping both token consumption and processing time together. For slide-heavy recordings or slow interactions, accuracy barely drops even at a lower FPS.
from google.genai import types
# low FPS + low resolution to scan a long clip cheaply
part = types.Part(
file_data=types.FileData(file_uri=video.uri, mime_type="video/mp4"),
video_metadata=types.VideoMetadata(
fps=0.5, # one frame per 2 seconds; good for low-motion clips
start_offset="30s", # skip the first 30s, start at the meat
end_offset="5m0s", # up to 5 min; clipping the range makes it lighter
),
)
resp = client.models.generate_content(
model="gemini-flash-latest",
contents=[part, "List 5 key actions in this span in order, with MM:SS."],
config=types.GenerateContentConfig(
media_resolution=types.MediaResolution.MEDIA_RESOLUTION_LOW,
),
)
print(resp.text)
print("tokens:", resp.usage_metadata.total_token_count)My split is simple. The first pass, mapping out "roughly where things are" in a long video, runs cheap and wide at fps=0.5 and MEDIA_RESOLUTION_LOW. Then, only for the spans I zeroed in on, I clip to about 30 seconds with start_offset/end_offset and raise FPS and resolution for a close read. That keeps tokens to a fraction of viewing the whole thing at high resolution from the start, while preserving accuracy where it counts. Logging usage_metadata.total_token_count every time lets you compare, from measurements, how much each setting helped.
One caveat: raising fps improves detection of brief moments, but tokens grow linearly. Only when you must reliably catch something that vanishes in a single frame — like a notification banner — go high-FPS, and only for that span. That's the cost-effective way.
Shift to reading in two passes
The fast path in video understanding, counterintuitively, was not close-reading from the start. Build a map of the whole with low FPS and low resolution, then re-read only the short spans you zeroed in on at high resolution. Have it return structured timestamps and that map becomes the input to downstream automation directly.
As a next step, take one screen recording you have and read it first at fps=0.5 with "list the key events with MM:SS." Pass the returned times to ffmpeg -ss and you pull just the matching frames instantly. The time you melted on scrubbing comes straight back to development. I hope it gives a first foothold to anyone buried in pulling information out of video, the same way I have.