◈ API / SDK/2026-06-30Advanced

Letting Gemini Listen to a Long Track and Build Its Chapters — Timestamped Structured Extraction

How I replaced hours of hand-chaptering long healing-audio tracks with Gemini's audio understanding: uploading long files via the Files API, pinning JSON output with response_schema, and the validation code that catches audio-specific quirks like timestamp drift and phantom silence.

gemini-api²⁵⁵ audio-understanding structured-output¹⁸ indie-dev³⁹ files-api⁴

✦ Premium Article

In a healing-sound app I run as an indie developer, marking "chapters" on long tracks — 40 to 80 minutes each — had always been manual work. The seam where the waves recede and a piano enters, the stretch of near-silent resonance, the spot where narration begins: I'd listen, take notes, and copy the playback seconds out by hand. Fifteen minutes per track, and a week with a batch of new releases ate half a day.

This is a record of trying to hand that "a human picks positions by ear" step straight to Gemini's audio understanding. It is not about transcription. The goal was to play the audio and get back structured data where playback position (a timestamp) is tied to content — things like "00:00–04:30 ambient intro," "near-silent resonance from 12:10."

The short version: it became usable. But audio carries a few habits you must not trust blindly, and it only made it into production once I wrapped it in validation. Here's the whole path.

Why audio understanding rather than a transcription tool

My first thought was to pair a dedicated transcription API with silence detection. I dropped it for a simple reason: what I want is not "the words" but "the seams between scenes." Healing tracks are mostly stretches with no speech at all, so transcription comes up empty. Gemini's audio understanding, on the other hand, takes the sound itself as context and returns non-verbal descriptions like "ambient-dominant" or "a repeating piano motif." That was the turning point.

On top of that, because I can lock the output with response_schema, the downstream app (chapter-jump UI, silence trimming) gets JSON it can eat safely. It ended up far shorter than a two-stage transcription-plus-heuristics approach.

Prerequisites and the cost feel

I use the newer google-genai SDK. Audio is billed at roughly 32 tokens per second, so an 80-minute track is about 150k input tokens before anything else. That is not negligible, and re-submitting many times quietly adds up. I run exploration and chaptering on gemini-flash-latest (an alias that points to 3.5 Flash as of June 2026) and pin a dated model in production. Aliases swap underneath you, so for steps that need stable output, pinning is the safe choice.

Here are tokens and rough latencies I measured on my own tracks. Model pricing shifts, so read this for the "it scales with length" feel rather than absolute cost.

Track length	Input tokens (measured)	One chaptering pass	Notes
8 min	~15,000	6–9 s	fits even as an inline send
42 min	~80,000	18–26 s	Files API recommended
78 min	~150,000	30–48 s	Files API required; re-sends hurt

Audio over 20MB can't be attached directly to a request, so long files are uploaded via the Files API and then referenced. My WAV tracks run to tens of megabytes, so in practice everything goes through the Files API.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If you've been re-listening to hour-long tracks by hand to mark chapters, you can run a working pipeline that has Gemini return timestamped chapters today

✦You'll learn long-file uploads via the Files API, locking JSON with response_schema, and handling MM:SS timestamps in copy-paste-ready form

✦You'll be able to mechanically reject audio-specific failure modes — drifted timestamps, non-existent silence — with validation code instead of trusting raw output

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Step 1: hand the long file over via the Files API

First, upload and wait until the file becomes processable (ACTIVE). Skip this and you'll submit a file still in PROCESSING right after upload and get an error. That was my first failure.

import time
from google import genai
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
def upload_and_wait(path: str, timeout_s: int = 300):
    """Upload audio and wait until it is ACTIVE."""
    f = client.files.upload(file=path)
    deadline = time.time() + timeout_s
    while f.state.name == "PROCESSING":
        if time.time() > deadline:
            raise TimeoutError(f"file stuck in PROCESSING: {f.name}")
        time.sleep(2)
        f = client.files.get(name=f.name)
    if f.state.name != "ACTIVE":
        raise RuntimeError(f"upload failed: state={f.state.name}")
    return f
 
audio = upload_and_wait("relaxing_session_42min.wav")
print("ready:", audio.name)  # e.g. ready: files/abcd1234

Just adding the loop that waits out PROCESSING made the sporadic failures on long files almost vanish. Uploaded files expire server-side (roughly 48 hours), so I do all the chaptering in one pass right after upload.

Step 2: pin timestamped JSON with response_schema

This is the core. Asking for JSON in the prompt alone tends to mix in prose before and after, or let key names drift. Declaring the type with response_schema and setting response_mime_type to JSON gives output the next stage can json.loads without flinching.

from pydantic import BaseModel
from google.genai import types
 
class Chapter(BaseModel):
    start: str   # "MM:SS"
    end: str     # "MM:SS"
    label: str   # short title (e.g. "ambient intro")
    kind: str    # one of: ambient / music / narration / near_silence
 
class ChapterList(BaseModel):
    chapters: list[Chapter]
 
PROMPT = """Listen to this track and split it into chapters at the seams where
the scene changes. Each chapter has a start and end time in MM:SS, a short
English title, and a kind. The kind is one of ambient (mostly ambient sound),
music (a melody present), narration (speech), or near_silence (near-silent
resonance). Do not invent segments that are not there; list what you heard in
chronological order."""
 
resp = client.models.generate_content(
    model="gemini-flash-latest",
    contents=[audio, PROMPT],
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=ChapterList,
        temperature=0.2,  # favor reproducibility for extraction
    ),
)
 
data = ChapterList.model_validate_json(resp.text)
for c in data.chapters:
    print(f"{c.start}-{c.end} [{c.kind}] {c.label}")

I keep temperature low because if the chapter boundaries shift every time I re-submit the same track, the app's diffing falls apart. This is extraction, not creation, so not letting it wander made operations easier.

Step 3: validate the timestamps before trusting them

Skip this and you'll get burned. Audio understanding is handy, but its timestamps have habits. Three I actually hit:

First, the end time can exceed the track's real length. On a 78-minute track I'd occasionally get a segment like "music until 82:15." Second, chapters overlap or arrive out of order. Third, phantom near_silence — silence that isn't there — which I saw most in the back half of long tracks.

So I take the real duration (via ffprobe or similar) as an upper bound and clamp, sort, and drop mechanically.

def mmss_to_sec(s: str) -> int:
    m, sec = s.split(":")
    return int(m) * 60 + int(sec)
 
def sec_to_mmss(t: int) -> str:
    return f"{t // 60:02d}:{t % 60:02d}"
 
def sanitize(chapters, duration_sec: int):
    cleaned = []
    for c in chapters:
        a, b = mmss_to_sec(c.start), mmss_to_sec(c.end)
        a = max(0, min(a, duration_sec))
        b = max(0, min(b, duration_sec))
        if b - a < 5:           # drop absurdly short (<5s) chapters
            continue
        cleaned.append((a, b, c.label, c.kind))
    cleaned.sort(key=lambda x: x[0])     # order by start time
    # resolve overlaps by pushing each start to the previous end
    fixed = []
    prev_end = 0
    for a, b, label, kind in cleaned:
        a = max(a, prev_end)
        if b <= a:
            continue
        fixed.append({"start": sec_to_mmss(a), "end": sec_to_mmss(b),
                      "label": label, "kind": kind})
        prev_end = b
    return fixed
 
clean = sanitize(data.chapters, duration_sec=78 * 60 + 12)

It helps to measure the drift, too. On my own tracks over 60 minutes, end times that exceeded the real length showed up zero to two per job, and even counting starts that slipped a few seconds, almost nothing inconsistent survived sanitize. Holding the chapter-jump failures to roughly 0% just by adding validation is my honest takeaway from running this in production.

Unglamorous, but whether you wrap these thirty-odd lines around the output is the difference between "the chapter-jump UI breaks now and then" and "it doesn't." Treat the model's output as raw data and guarantee the final shape in your own code — a line I try not to cross in any structured extraction, audio or otherwise.

Step 4: turn near_silence into trimming hints

As a byproduct, the kind == "near_silence" segments became candidates for "where to cut the tail resonance." My tracks often end with a long fade, and trimming store-preview clips by hand was tedious. Pulling near_silence from the validated chapter list gives a draft of the trim points automatically.

tail_silence = [c for c in clean if c["kind"] == "near_silence"
                and mmss_to_sec(c["start"]) > duration_sec * 0.8]
# use as candidate fade-out start points when generating previews

The key is to treat it strictly as a draft. The final cut point I still confirm by ear, once. Rather than automating everything, the realistic target was to drop the human check from fifteen minutes per track to one.

The spots that trip people up

In the order they cost me time: missing the PROCESSING state and submitting too early; forgetting response_schema and getting prose mixed in; and above all, piping timestamps to the app without validation and having chapter jumps exceed the real length. That last one hid during testing because I used short clips, and only surfaced on production-length tracks. Long-file behavior doesn't show up unless you test with long files.

A word on token billing for audio input, too. During prompt tuning, re-submitting the same 78-minute track repeatedly stacks 150k input tokens per pass. I now tune prompts on an 8-minute excerpt and decide the real long track in a single pass. You can reuse the uploaded file, but the generation tokens recur every time.

Next step

Take one longer track you have and run it through the minimal upload_and_wait plus response_schema code. Pipe the returned JSON through sanitize and just check that chapter starts and ends fall within the real duration — fifteen minutes is enough to judge whether this works on your own data. Catching "the seams between wordless scenes," which transcription misses, is the real reason to reach for audio understanding.

If you want to push timestamp design further, reading video by frames is a close cousin — see controlling timestamp queries and FPS in Gemini API video understanding. For stabilizing structured output itself, guarding Gemini API structured output with schema validation goes into detail.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.