GEMINI LABJP
MODEL — Gemini 3.5 Flash is generally available as Google's top pick for agentic and coding tasksAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Linux sandboxesWEBHOOK — Event-driven webhooks now cover the Batch API and long-running ops, removing pollingSECURITY — From June 19, requests from unrestricted API keys are blocked — review your key limitsDEPRECATED — Two image-preview models shut down June 25 — migrate any preview-dependent flowsCODEASSIST — Since June 18, individual Code Assist extensions and CLI stopped serving Pro/Ultra tiersMODEL — Gemini 3.5 Flash is generally available as Google's top pick for agentic and coding tasksAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Linux sandboxesWEBHOOK — Event-driven webhooks now cover the Batch API and long-running ops, removing pollingSECURITY — From June 19, requests from unrestricted API keys are blocked — review your key limitsDEPRECATED — Two image-preview models shut down June 25 — migrate any preview-dependent flowsCODEASSIST — Since June 18, individual Code Assist extensions and CLI stopped serving Pro/Ultra tiers
Articles/API / SDK
API / SDK/2026-06-30Advanced

Letting Gemini Listen to a Long Track and Build Its Chapters — Timestamped Structured Extraction

How I replaced hours of hand-chaptering long healing-audio tracks with Gemini's audio understanding: uploading long files via the Files API, pinning JSON output with response_schema, and the validation code that catches audio-specific quirks like timestamp drift and phantom silence.

gemini-api255audio-understandingstructured-output18indie-dev39files-api4

Premium Article

In a healing-sound app I run as an indie developer, marking "chapters" on long tracks — 40 to 80 minutes each — had always been manual work. The seam where the waves recede and a piano enters, the stretch of near-silent resonance, the spot where narration begins: I'd listen, take notes, and copy the playback seconds out by hand. Fifteen minutes per track, and a week with a batch of new releases ate half a day.

This is a record of trying to hand that "a human picks positions by ear" step straight to Gemini's audio understanding. It is not about transcription. The goal was to play the audio and get back structured data where playback position (a timestamp) is tied to content — things like "00:00–04:30 ambient intro," "near-silent resonance from 12:10."

The short version: it became usable. But audio carries a few habits you must not trust blindly, and it only made it into production once I wrapped it in validation. Here's the whole path.

Why audio understanding rather than a transcription tool

My first thought was to pair a dedicated transcription API with silence detection. I dropped it for a simple reason: what I want is not "the words" but "the seams between scenes." Healing tracks are mostly stretches with no speech at all, so transcription comes up empty. Gemini's audio understanding, on the other hand, takes the sound itself as context and returns non-verbal descriptions like "ambient-dominant" or "a repeating piano motif." That was the turning point.

On top of that, because I can lock the output with response_schema, the downstream app (chapter-jump UI, silence trimming) gets JSON it can eat safely. It ended up far shorter than a two-stage transcription-plus-heuristics approach.

Prerequisites and the cost feel

I use the newer google-genai SDK. Audio is billed at roughly 32 tokens per second, so an 80-minute track is about 150k input tokens before anything else. That is not negligible, and re-submitting many times quietly adds up. I run exploration and chaptering on gemini-flash-latest (an alias that points to 3.5 Flash as of June 2026) and pin a dated model in production. Aliases swap underneath you, so for steps that need stable output, pinning is the safe choice.

Here are tokens and rough latencies I measured on my own tracks. Model pricing shifts, so read this for the "it scales with length" feel rather than absolute cost.

Track lengthInput tokens (measured)One chaptering passNotes
8 min~15,0006–9 sfits even as an inline send
42 min~80,00018–26 sFiles API recommended
78 min~150,00030–48 sFiles API required; re-sends hurt

Audio over 20MB can't be attached directly to a request, so long files are uploaded via the Files API and then referenced. My WAV tracks run to tens of megabytes, so in practice everything goes through the Files API.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
If you've been re-listening to hour-long tracks by hand to mark chapters, you can run a working pipeline that has Gemini return timestamped chapters today
You'll learn long-file uploads via the Files API, locking JSON with response_schema, and handling MM:SS timestamps in copy-paste-ready form
You'll be able to mechanically reject audio-specific failure modes — drifted timestamps, non-existent silence — with validation code instead of trusting raw output
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-05-16
Automating Firebase Crashlytics Analysis with Gemini API — A Real-World Implementation from an Indie App
A real-world implementation record of automating Firebase Crashlytics log analysis with Gemini API, validated in the development of a wallpaper app with 50M+ downloads. Includes Before/After code for a RecyclerView crash fix and a production cost breakdown.
API / SDK2026-06-26
Reliable Text-in-Image with Gemini 3.1 Flash Image — an OCR-Verified Pipeline
After the preview shutdown, the GA gemini-3.1-flash-image still occasionally garbles text baked into images. Here is a generate -> read-back-verify -> regenerate/composite pipeline, with working code and an unattended retry budget.
API / SDK2026-06-25
Gemini API × TypeScript Type-Safe AI Application Architecture — Integrating Zod Schemas, Structured Output, and Streaming
Learn how to build type-safe AI applications with the Gemini API and TypeScript. This guide covers Zod validation, Structured Output, streaming pipelines, and robust error handling for production architectures.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →