●MODEL — Gemini 3.5 Flash is generally available as Google's top pick for agentic and coding tasks●AGENT — Managed Agents enter public preview in the Gemini API, running in isolated Linux sandboxes●WEBHOOK — Event-driven webhooks now cover the Batch API and long-running ops, removing polling●SECURITY — From June 19, requests from unrestricted API keys are blocked — review your key limits●DEPRECATED — Two image-preview models shut down June 25 — migrate any preview-dependent flows●CODEASSIST — Since June 18, individual Code Assist extensions and CLI stopped serving Pro/Ultra tiers●MODEL — Gemini 3.5 Flash is generally available as Google's top pick for agentic and coding tasks●AGENT — Managed Agents enter public preview in the Gemini API, running in isolated Linux sandboxes●WEBHOOK — Event-driven webhooks now cover the Batch API and long-running ops, removing polling●SECURITY — From June 19, requests from unrestricted API keys are blocked — review your key limits●DEPRECATED — Two image-preview models shut down June 25 — migrate any preview-dependent flows●CODEASSIST — Since June 18, individual Code Assist extensions and CLI stopped serving Pro/Ultra tiers
Letting Gemini Listen to a Long Track and Build Its Chapters — Timestamped Structured Extraction
How I replaced hours of hand-chaptering long healing-audio tracks with Gemini's audio understanding: uploading long files via the Files API, pinning JSON output with response_schema, and the validation code that catches audio-specific quirks like timestamp drift and phantom silence.
In a healing-sound app I run as an indie developer, marking "chapters" on long tracks — 40 to 80 minutes each — had always been manual work. The seam where the waves recede and a piano enters, the stretch of near-silent resonance, the spot where narration begins: I'd listen, take notes, and copy the playback seconds out by hand. Fifteen minutes per track, and a week with a batch of new releases ate half a day.
This is a record of trying to hand that "a human picks positions by ear" step straight to Gemini's audio understanding. It is not about transcription. The goal was to play the audio and get back structured data where playback position (a timestamp) is tied to content — things like "00:00–04:30 ambient intro," "near-silent resonance from 12:10."
The short version: it became usable. But audio carries a few habits you must not trust blindly, and it only made it into production once I wrapped it in validation. Here's the whole path.
Why audio understanding rather than a transcription tool
My first thought was to pair a dedicated transcription API with silence detection. I dropped it for a simple reason: what I want is not "the words" but "the seams between scenes." Healing tracks are mostly stretches with no speech at all, so transcription comes up empty. Gemini's audio understanding, on the other hand, takes the sound itself as context and returns non-verbal descriptions like "ambient-dominant" or "a repeating piano motif." That was the turning point.
On top of that, because I can lock the output with response_schema, the downstream app (chapter-jump UI, silence trimming) gets JSON it can eat safely. It ended up far shorter than a two-stage transcription-plus-heuristics approach.
Prerequisites and the cost feel
I use the newer google-genai SDK. Audio is billed at roughly 32 tokens per second, so an 80-minute track is about 150k input tokens before anything else. That is not negligible, and re-submitting many times quietly adds up. I run exploration and chaptering on gemini-flash-latest (an alias that points to 3.5 Flash as of June 2026) and pin a dated model in production. Aliases swap underneath you, so for steps that need stable output, pinning is the safe choice.
Here are tokens and rough latencies I measured on my own tracks. Model pricing shifts, so read this for the "it scales with length" feel rather than absolute cost.
Track length
Input tokens (measured)
One chaptering pass
Notes
8 min
~15,000
6–9 s
fits even as an inline send
42 min
~80,000
18–26 s
Files API recommended
78 min
~150,000
30–48 s
Files API required; re-sends hurt
Audio over 20MB can't be attached directly to a request, so long files are uploaded via the Files API and then referenced. My WAV tracks run to tens of megabytes, so in practice everything goes through the Files API.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦If you've been re-listening to hour-long tracks by hand to mark chapters, you can run a working pipeline that has Gemini return timestamped chapters today
✦You'll learn long-file uploads via the Files API, locking JSON with response_schema, and handling MM:SS timestamps in copy-paste-ready form
✦You'll be able to mechanically reject audio-specific failure modes — drifted timestamps, non-existent silence — with validation code instead of trusting raw output
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
First, upload and wait until the file becomes processable (ACTIVE). Skip this and you'll submit a file still in PROCESSING right after upload and get an error. That was my first failure.
import timefrom google import genaiclient = genai.Client(api_key="YOUR_GEMINI_API_KEY")def upload_and_wait(path: str, timeout_s: int = 300): """Upload audio and wait until it is ACTIVE.""" f = client.files.upload(file=path) deadline = time.time() + timeout_s while f.state.name == "PROCESSING": if time.time() > deadline: raise TimeoutError(f"file stuck in PROCESSING: {f.name}") time.sleep(2) f = client.files.get(name=f.name) if f.state.name != "ACTIVE": raise RuntimeError(f"upload failed: state={f.state.name}") return faudio = upload_and_wait("relaxing_session_42min.wav")print("ready:", audio.name) # e.g. ready: files/abcd1234
Just adding the loop that waits out PROCESSING made the sporadic failures on long files almost vanish. Uploaded files expire server-side (roughly 48 hours), so I do all the chaptering in one pass right after upload.
Step 2: pin timestamped JSON with response_schema
This is the core. Asking for JSON in the prompt alone tends to mix in prose before and after, or let key names drift. Declaring the type with response_schema and setting response_mime_type to JSON gives output the next stage can json.loads without flinching.
from pydantic import BaseModelfrom google.genai import typesclass Chapter(BaseModel): start: str # "MM:SS" end: str # "MM:SS" label: str # short title (e.g. "ambient intro") kind: str # one of: ambient / music / narration / near_silenceclass ChapterList(BaseModel): chapters: list[Chapter]PROMPT = """Listen to this track and split it into chapters at the seams wherethe scene changes. Each chapter has a start and end time in MM:SS, a shortEnglish title, and a kind. The kind is one of ambient (mostly ambient sound),music (a melody present), narration (speech), or near_silence (near-silentresonance). Do not invent segments that are not there; list what you heard inchronological order."""resp = client.models.generate_content( model="gemini-flash-latest", contents=[audio, PROMPT], config=types.GenerateContentConfig( response_mime_type="application/json", response_schema=ChapterList, temperature=0.2, # favor reproducibility for extraction ),)data = ChapterList.model_validate_json(resp.text)for c in data.chapters: print(f"{c.start}-{c.end} [{c.kind}] {c.label}")
I keep temperature low because if the chapter boundaries shift every time I re-submit the same track, the app's diffing falls apart. This is extraction, not creation, so not letting it wander made operations easier.
Step 3: validate the timestamps before trusting them
Skip this and you'll get burned. Audio understanding is handy, but its timestamps have habits. Three I actually hit:
First, the end time can exceed the track's real length. On a 78-minute track I'd occasionally get a segment like "music until 82:15." Second, chapters overlap or arrive out of order. Third, phantom near_silence — silence that isn't there — which I saw most in the back half of long tracks.
So I take the real duration (via ffprobe or similar) as an upper bound and clamp, sort, and drop mechanically.
def mmss_to_sec(s: str) -> int: m, sec = s.split(":") return int(m) * 60 + int(sec)def sec_to_mmss(t: int) -> str: return f"{t // 60:02d}:{t % 60:02d}"def sanitize(chapters, duration_sec: int): cleaned = [] for c in chapters: a, b = mmss_to_sec(c.start), mmss_to_sec(c.end) a = max(0, min(a, duration_sec)) b = max(0, min(b, duration_sec)) if b - a < 5: # drop absurdly short (<5s) chapters continue cleaned.append((a, b, c.label, c.kind)) cleaned.sort(key=lambda x: x[0]) # order by start time # resolve overlaps by pushing each start to the previous end fixed = [] prev_end = 0 for a, b, label, kind in cleaned: a = max(a, prev_end) if b <= a: continue fixed.append({"start": sec_to_mmss(a), "end": sec_to_mmss(b), "label": label, "kind": kind}) prev_end = b return fixedclean = sanitize(data.chapters, duration_sec=78 * 60 + 12)
It helps to measure the drift, too. On my own tracks over 60 minutes, end times that exceeded the real length showed up zero to two per job, and even counting starts that slipped a few seconds, almost nothing inconsistent survived sanitize. Holding the chapter-jump failures to roughly 0% just by adding validation is my honest takeaway from running this in production.
Unglamorous, but whether you wrap these thirty-odd lines around the output is the difference between "the chapter-jump UI breaks now and then" and "it doesn't." Treat the model's output as raw data and guarantee the final shape in your own code — a line I try not to cross in any structured extraction, audio or otherwise.
Step 4: turn near_silence into trimming hints
As a byproduct, the kind == "near_silence" segments became candidates for "where to cut the tail resonance." My tracks often end with a long fade, and trimming store-preview clips by hand was tedious. Pulling near_silence from the validated chapter list gives a draft of the trim points automatically.
tail_silence = [c for c in clean if c["kind"] == "near_silence" and mmss_to_sec(c["start"]) > duration_sec * 0.8]# use as candidate fade-out start points when generating previews
The key is to treat it strictly as a draft. The final cut point I still confirm by ear, once. Rather than automating everything, the realistic target was to drop the human check from fifteen minutes per track to one.
The spots that trip people up
In the order they cost me time: missing the PROCESSING state and submitting too early; forgetting response_schema and getting prose mixed in; and above all, piping timestamps to the app without validation and having chapter jumps exceed the real length. That last one hid during testing because I used short clips, and only surfaced on production-length tracks. Long-file behavior doesn't show up unless you test with long files.
A word on token billing for audio input, too. During prompt tuning, re-submitting the same 78-minute track repeatedly stacks 150k input tokens per pass. I now tune prompts on an 8-minute excerpt and decide the real long track in a single pass. You can reuse the uploaded file, but the generation tokens recur every time.
Next step
Take one longer track you have and run it through the minimal upload_and_wait plus response_schema code. Pipe the returned JSON through sanitize and just check that chapter starts and ends fall within the real duration — fifteen minutes is enough to judge whether this works on your own data. Catching "the seams between wordless scenes," which transcription misses, is the real reason to reach for audio understanding.
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.