Integrating Gemini TTS API into SwiftUI — Two AVAudioEngine Pitfalls I Hit

I've been building iOS apps since 2014. Across wallpaper apps, wellness apps, and attraction-focused apps — eventually passing 50 million total downloads — I kept pushing one feature to the backlog: natural-sounding text-to-speech. Apple's AVSpeechSynthesizer works fine for simple use cases, but the voice quality has a ceiling that's hard to ignore when your users are listening for more than a few seconds.

When I tried wiring up Gemini's TTS API to a SwiftUI app, I ran into two issues that weren't in any documentation I could find. This post covers exactly those two problems and how I resolved them.

What Gemini TTS API Actually Returns

The API returns audio in raw PCM format: signed 16-bit, 24000 Hz, mono. Not MP3. Not AAC. Raw PCM bytes, Base64-encoded inside the response JSON.

You can confirm this quickly with a Python test:

import google.generativeai as genai
 
genai.configure(api_key="YOUR_GEMINI_API_KEY")
 
client = genai.Client()
response = client.models.generate_content(
    model="gemini-2.5-flash-preview-tts",
    contents="Thank you for using our app today.",
    config={
        "response_modalities": ["AUDIO"],
        "speech_config": {
            "voice_config": {
                "prebuilt_voice_config": {
                    "voice_name": "Aoede"
                }
            }
        }
    }
)
 
audio_b64 = response.candidates[0].content.parts[0].inline_data.data
# This is Base64-encoded raw PCM (16-bit signed, 24000 Hz, mono)
print(f"Audio data length: {len(audio_b64)} chars (Base64)")

Knowing the exact format — 16-bit signed, 24000 Hz, mono — is essential before you touch a single line of Swift.

Pitfall #1: AVAudioPlayer Won't Accept Raw PCM

My first instinct was to grab the Base64 string, decode it to Data, and feed it into AVAudioPlayer(data:). That's the simplest path, and it fails silently (or throws an unhelpful error).

// ❌ This does not work — AVAudioPlayer expects a container format
let audioData = Data(base64Encoded: base64String)!
let player = try AVAudioPlayer(data: audioData)  // Error: OSStatus -50
player.play()

AVAudioPlayer expects file-format containers like WAV or MP3. It doesn't know how to interpret headerless PCM.

You have two options:

Option A: Prepend a WAV header to the raw PCM data and use AVAudioPlayer
Option B: Use AVAudioEngine + AVAudioPlayerNode to play the PCM buffer directly

I went with Option B. It takes a few more lines upfront, but it gives you more control — useful when you want to add effects or mix audio later.

The Working Implementation with AVAudioEngine

import SwiftUI
import AVFoundation
 
class GeminiTTSPlayer: ObservableObject {
    private var engine = AVAudioEngine()
    private var playerNode = AVAudioPlayerNode()
 
    // Match Gemini TTS API output: 24000 Hz, mono, signed 16-bit PCM
    private let geminiFormat = AVAudioFormat(
        commonFormat: .pcmFormatInt16,
        sampleRate: 24000,
        channels: 1,
        interleaved: true
    )!
 
    init() {
        engine.attach(playerNode)
        engine.connect(playerNode, to: engine.mainMixerNode, format: geminiFormat)
    }
 
    func play(pcmData: Data) throws {
        // Configure audio session right before playback (see Pitfall #2)
        try configureAudioSession()
 
        if !engine.isRunning {
            try engine.start()
        }
 
        // Build AVAudioPCMBuffer from raw PCM bytes
        let frameCount = UInt32(pcmData.count) / 2  // 16-bit = 2 bytes per frame
        guard let buffer = AVAudioPCMBuffer(
            pcmFormat: geminiFormat,
            frameCapacity: frameCount
        ) else {
            throw TTSError.bufferCreationFailed
        }
 
        buffer.frameLength = frameCount
 
        pcmData.withUnsafeBytes { rawPtr in
            if let src = rawPtr.baseAddress?.assumingMemoryBound(to: Int16.self),
               let dst = buffer.int16ChannelData?[0] {
                dst.initialize(from: src, count: Int(frameCount))
            }
        }
 
        playerNode.scheduleBuffer(buffer, completionHandler: nil)
        playerNode.play()
    }
 
    private func configureAudioSession() throws {
        let session = AVAudioSession.sharedInstance()
        try session.setCategory(.playback, mode: .spokenAudio)
        try session.setActive(true)
    }
 
    enum TTSError: Error {
        case bufferCreationFailed
    }
}

Pitfall #2: AVAudioSession Timing on Real Devices

If you move configureAudioSession() into init() (which feels natural for setup code), you'll get a frustrating bug: it works in the Simulator, but there's no audio on a real device.

The issue is a race between your app's audio session initialization and the system's audio routing at startup. The fix: always call setCategory and setActive(true) immediately before the playback call, not during initialization.

The rule I follow:

Call AVAudioSession.setCategory and setActive(true)
Then call AVAudioEngine.start()
Then schedule your buffer and call play()

This ordering is exactly what the code above does — configureAudioSession() runs at the top of play(pcmData:), guaranteeing the correct sequence every time.

Connecting to SwiftUI

struct ContentView: View {
    @StateObject private var ttsPlayer = GeminiTTSPlayer()
    @State private var isLoading = false
 
    var body: some View {
        VStack(spacing: 24) {
            Text("Gemini TTS Demo")
                .font(.title2)
 
            Button(action: {
                Task { await speakWithGemini("Thank you for visiting today. We hope you enjoy the experience.") }
            }) {
                Label(
                    isLoading ? "Generating..." : "Read Aloud",
                    systemImage: "speaker.wave.2"
                )
                .padding()
                .frame(maxWidth: .infinity)
                .background(Color.blue)
                .foregroundColor(.white)
                .cornerRadius(12)
            }
            .disabled(isLoading)
        }
        .padding()
    }
 
    func speakWithGemini(_ text: String) async {
        isLoading = true
        defer { isLoading = false }
 
        do {
            let pcmData = try await GeminiTTSAPI.generateAudio(text: text)
            try ttsPlayer.play(pcmData: pcmData)
        } catch {
            print("TTS error: \(error)")
        }
    }
}

The GeminiTTSAPI.generateAudio call hits Gemini's REST endpoint via URLSession, extracts the Base64 audio string from the response JSON, decodes it, and returns the raw Data. As of this writing, the official Gemini Swift SDK doesn't have first-party TTS support, so a direct REST call is the way to go.

Honest Impressions After Using It in Production

I wired this up to read aloud the "today's featured wallpaper" description text in one of my wallpaper apps. Compared to AVSpeechSynthesizer, the voice quality difference is genuinely noticeable — especially for longer Japanese sentences, where Gemini's prosody stays natural much longer before it starts sounding robotic.

The tradeoff is API latency. For 50 characters of Japanese text, I'm seeing 500ms to 1 second of delay before audio starts. For non-real-time use cases — reading out article summaries, descriptions, tips — that's acceptable. For anything conversational or reactive, you'd want to look at Gemini Live API and its streaming audio approach instead.

As an indie developer, this kind of "extra effort" feature is often where the real differentiation happens. The SwiftUI integration itself is straightforward once you know the two pitfalls — everything else falls into place.

Key Takeaways

Gemini TTS API outputs raw PCM: 16-bit signed, 24000 Hz, mono — no container format
AVAudioPlayer(data:) won't work — use AVAudioEngine + AVAudioPCMBuffer
Call AVAudioSession.setActive(true) immediately before AVAudioEngine.start(), not in init()
Test on a real device early — Simulator behavior diverges from hardware for audio

Start with a short piece of text to verify your setup, then test across different audio scenarios: speaker, wired headphones, Bluetooth. Audio routing on iOS has edge cases that only show up on real hardware.