GEMINI LABJP
SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soonSIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
Articles/API / SDK
API / SDK/2026-05-14Intermediate

Integrating Gemini TTS API into SwiftUI — Two AVAudioEngine Pitfalls I Hit

A practical guide to playing Gemini TTS API's raw PCM audio in SwiftUI using AVAudioEngine. Covers the two hidden pitfalls around PCM format handling and AVAudioSession timing that the official docs don't mention.

gemini-api285tts6swiftui4ios12avaudioenginetext-to-speech3

I've been building iOS apps since 2014. Across wallpaper apps, wellness apps, and attraction-focused apps — eventually passing 50 million total downloads — I kept pushing one feature to the backlog: natural-sounding text-to-speech. Apple's AVSpeechSynthesizer works fine for simple use cases, but the voice quality has a ceiling that's hard to ignore when your users are listening for more than a few seconds.

When I tried wiring up Gemini's TTS API to a SwiftUI app, I ran into two issues that weren't in any documentation I could find. This post covers exactly those two problems and how I resolved them.

What Gemini TTS API Actually Returns

The API returns audio in raw PCM format: signed 16-bit, 24000 Hz, mono. Not MP3. Not AAC. Raw PCM bytes, Base64-encoded inside the response JSON.

You can confirm this quickly with a Python test:

import google.generativeai as genai
 
genai.configure(api_key="YOUR_GEMINI_API_KEY")
 
client = genai.Client()
response = client.models.generate_content(
    model="gemini-2.5-flash-preview-tts",
    contents="Thank you for using our app today.",
    config={
        "response_modalities": ["AUDIO"],
        "speech_config": {
            "voice_config": {
                "prebuilt_voice_config": {
                    "voice_name": "Aoede"
                }
            }
        }
    }
)
 
audio_b64 = response.candidates[0].content.parts[0].inline_data.data
# This is Base64-encoded raw PCM (16-bit signed, 24000 Hz, mono)
print(f"Audio data length: {len(audio_b64)} chars (Base64)")

Knowing the exact format — 16-bit signed, 24000 Hz, mono — is essential before you touch a single line of Swift.

Pitfall #1: AVAudioPlayer Won't Accept Raw PCM

My first instinct was to grab the Base64 string, decode it to Data, and feed it into AVAudioPlayer(data:). That's the simplest path, and it fails silently (or throws an unhelpful error).

// ❌ This does not work — AVAudioPlayer expects a container format
let audioData = Data(base64Encoded: base64String)!
let player = try AVAudioPlayer(data: audioData)  // Error: OSStatus -50
player.play()

AVAudioPlayer expects file-format containers like WAV or MP3. It doesn't know how to interpret headerless PCM.

You have two options:

  • Option A: Prepend a WAV header to the raw PCM data and use AVAudioPlayer
  • Option B: Use AVAudioEngine + AVAudioPlayerNode to play the PCM buffer directly

I went with Option B. It takes a few more lines upfront, but it gives you more control — useful when you want to add effects or mix audio later.

The Working Implementation with AVAudioEngine

import SwiftUI
import AVFoundation
 
class GeminiTTSPlayer: ObservableObject {
    private var engine = AVAudioEngine()
    private var playerNode = AVAudioPlayerNode()
 
    // Match Gemini TTS API output: 24000 Hz, mono, signed 16-bit PCM
    private let geminiFormat = AVAudioFormat(
        commonFormat: .pcmFormatInt16,
        sampleRate: 24000,
        channels: 1,
        interleaved: true
    )!
 
    init() {
        engine.attach(playerNode)
        engine.connect(playerNode, to: engine.mainMixerNode, format: geminiFormat)
    }
 
    func play(pcmData: Data) throws {
        // Configure audio session right before playback (see Pitfall #2)
        try configureAudioSession()
 
        if !engine.isRunning {
            try engine.start()
        }
 
        // Build AVAudioPCMBuffer from raw PCM bytes
        let frameCount = UInt32(pcmData.count) / 2  // 16-bit = 2 bytes per frame
        guard let buffer = AVAudioPCMBuffer(
            pcmFormat: geminiFormat,
            frameCapacity: frameCount
        ) else {
            throw TTSError.bufferCreationFailed
        }
 
        buffer.frameLength = frameCount
 
        pcmData.withUnsafeBytes { rawPtr in
            if let src = rawPtr.baseAddress?.assumingMemoryBound(to: Int16.self),
               let dst = buffer.int16ChannelData?[0] {
                dst.initialize(from: src, count: Int(frameCount))
            }
        }
 
        playerNode.scheduleBuffer(buffer, completionHandler: nil)
        playerNode.play()
    }
 
    private func configureAudioSession() throws {
        let session = AVAudioSession.sharedInstance()
        try session.setCategory(.playback, mode: .spokenAudio)
        try session.setActive(true)
    }
 
    enum TTSError: Error {
        case bufferCreationFailed
    }
}

Pitfall #2: AVAudioSession Timing on Real Devices

If you move configureAudioSession() into init() (which feels natural for setup code), you'll get a frustrating bug: it works in the Simulator, but there's no audio on a real device.

The issue is a race between your app's audio session initialization and the system's audio routing at startup. The fix: always call setCategory and setActive(true) immediately before the playback call, not during initialization.

The rule I follow:

  1. Call AVAudioSession.setCategory and setActive(true)
  2. Then call AVAudioEngine.start()
  3. Then schedule your buffer and call play()

This ordering is exactly what the code above does — configureAudioSession() runs at the top of play(pcmData:), guaranteeing the correct sequence every time.

Connecting to SwiftUI

struct ContentView: View {
    @StateObject private var ttsPlayer = GeminiTTSPlayer()
    @State private var isLoading = false
 
    var body: some View {
        VStack(spacing: 24) {
            Text("Gemini TTS Demo")
                .font(.title2)
 
            Button(action: {
                Task { await speakWithGemini("Thank you for visiting today. We hope you enjoy the experience.") }
            }) {
                Label(
                    isLoading ? "Generating..." : "Read Aloud",
                    systemImage: "speaker.wave.2"
                )
                .padding()
                .frame(maxWidth: .infinity)
                .background(Color.blue)
                .foregroundColor(.white)
                .cornerRadius(12)
            }
            .disabled(isLoading)
        }
        .padding()
    }
 
    func speakWithGemini(_ text: String) async {
        isLoading = true
        defer { isLoading = false }
 
        do {
            let pcmData = try await GeminiTTSAPI.generateAudio(text: text)
            try ttsPlayer.play(pcmData: pcmData)
        } catch {
            print("TTS error: \(error)")
        }
    }
}

The GeminiTTSAPI.generateAudio call hits Gemini's REST endpoint via URLSession, extracts the Base64 audio string from the response JSON, decodes it, and returns the raw Data. As of this writing, the official Gemini Swift SDK doesn't have first-party TTS support, so a direct REST call is the way to go.

Honest Impressions After Using It in Production

I wired this up to read aloud the "today's featured wallpaper" description text in one of my wallpaper apps. Compared to AVSpeechSynthesizer, the voice quality difference is genuinely noticeable — especially for longer Japanese sentences, where Gemini's prosody stays natural much longer before it starts sounding robotic.

The tradeoff is API latency. For 50 characters of Japanese text, I'm seeing 500ms to 1 second of delay before audio starts. For non-real-time use cases — reading out article summaries, descriptions, tips — that's acceptable. For anything conversational or reactive, you'd want to look at Gemini Live API and its streaming audio approach instead.

As an indie developer, this kind of "extra effort" feature is often where the real differentiation happens. The SwiftUI integration itself is straightforward once you know the two pitfalls — everything else falls into place.

Key Takeaways

  • Gemini TTS API outputs raw PCM: 16-bit signed, 24000 Hz, mono — no container format
  • AVAudioPlayer(data:) won't work — use AVAudioEngine + AVAudioPCMBuffer
  • Call AVAudioSession.setActive(true) immediately before AVAudioEngine.start(), not in init()
  • Test on a real device early — Simulator behavior diverges from hardware for audio

Start with a short piece of text to verify your setup, then test across different audio scenarios: speaker, wired headphones, Bluetooth. Audio routing on iOS has edge cases that only show up on real hardware.

Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

API / SDK2026-05-12
Integrating Gemini 3.2 Pro Function Calling into iOS/Android Apps: Design Patterns from 12 Years of Indie Development
A practical guide to integrating Gemini 3.2 Pro Function Calling into iOS and Android apps. Includes working SwiftUI and Kotlin code examples, plus production patterns learned from 12 years of indie development and 50 million app downloads.
API / SDK2026-04-03
Gemini API × SwiftUI in Production: Streaming, Multimodal, Error Handling, and App Store Submission
A production-grade guide to integrating the Gemini API into SwiftUI apps at production quality. Covers streaming responses, multimodal input, error handling, test strategies, and App Store submission requirements.
API / SDK2026-06-02
Why Firebase AI Logic Returns 403 When Calling Gemini from iOS — And How to Fix It
Firebase AI Logic (formerly Vertex AI in Firebase) often returns 403 PERMISSION_DENIED when calling Gemini from an iOS app. Here is how to isolate the three real causes — App Check enforcement, disabled APIs, and missing Blaze billing — based on hands-on device testing.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →