◈ API / SDK/2026-04-03Advanced

Gemini API × SwiftUI in Production: Streaming, Multimodal, Error Handling, and App Store Submission

A production-grade guide to integrating the Gemini API into SwiftUI apps at production quality. Covers streaming responses, multimodal input, error handling, test strategies, and App Store submission requirements.

gemini-api²⁷⁸ swift⁴ swiftui⁴ ios¹² mobile⁶ streaming²⁸ multimodal⁴⁴ app-store⁷

✦ Premium Article

When you move beyond prototyping and start shipping a Gemini-powered iOS app to real users, the challenges multiply quickly. Streams that cut off unexpectedly, memory spikes from full-resolution images, App Store rejections over API key handling — these are the problems that separate hobbyist experiments from production-ready products. The gap between "it works on my simulator" and "it works for thousands of users on diverse networks and devices" is wide, and bridging it requires both technical depth and hard-won operational knowledge.

This guide tackles all of it. Rather than relying on the Firebase AI Logic SDK, we build directly on URLSession and Swift's async/await, giving you full control over every request and every failure mode. You will walk away with production-ready patterns for streaming, multimodal inputs, caching, testing, and App Store compliance — all backed by working code that you can drop into a real project today.

For the foundational Firebase-based approach, see our free guide on Integrating Gemini API into iOS Apps with Firebase AI Logic SDK. This article is the deep-dive that comes next, building on the concepts there and pushing them toward production quality.

Why Skip Firebase? The Case for Direct URLSession Integration

Firebase AI Logic SDK is an excellent starting point. It handles authentication, SDK initialization, and provides a clean Swift interface over the Gemini API. For apps already invested in Firebase's ecosystem, it makes perfect sense.

But many iOS apps do not need Firebase. Adding it introduces a substantial dependency graph — multiple frameworks, a Firebase project to maintain, Google Analytics initialization in your app delegate, and roughly 20MB added to your binary. For apps where Gemini API is the only Google service in use, that overhead is hard to justify.

The direct URLSession approach has a very different profile. Your only dependency is the iOS SDK itself. You have complete visibility into every HTTP request and response. You can tune headers, timeouts, and retry behavior to exactly match your needs. And you eliminate an entire layer of abstraction that could obscure the source of bugs in production.

The tradeoff is that you write more infrastructure code upfront. This guide gives you that infrastructure, polished and ready to adapt.

Environment Setup: Safe API Key Management

The first and most important architectural decision is how to store your API key. This choice has direct implications for App Store approval, security, and long-term maintainability.

Never store your API key in Info.plist. App Store review processes include static analysis that can detect embedded credentials. Beyond review, a determined attacker can extract values from Info.plist through binary analysis tools widely available on jailbroken devices. Even with obfuscation, this is not a reliable defense.

The Keychain is the correct home for sensitive credentials on iOS. It stores values in hardware-encrypted storage, isolated per app, and protected by the device's secure enclave where available.

// APIKeychain.swift — Keychain-based API key storage
import Security
 
final class APIKeychain {
    static let shared = APIKeychain()
    private let service = "net.gemilab.gemini-api-key"
 
    func save(key: String) throws {
        let data = Data(key.utf8)
        let query: [String: Any] = [
            kSecClass as String: kSecClassGenericPassword,
            kSecAttrService as String: service,
            kSecValueData as String: data
        ]
        // Delete before insert to avoid duplicate item errors
        SecItemDelete(query as CFDictionary)
        let status = SecItemAdd(query as CFDictionary, nil)
        guard status == errSecSuccess else {
            throw KeychainError.saveFailed(status)
        }
    }
 
    func load() throws -> String {
        let query: [String: Any] = [
            kSecClass as String: kSecClassGenericPassword,
            kSecAttrService as String: service,
            kSecMatchLimit as String: kSecMatchLimitOne,
            kSecReturnData as String: true
        ]
        var result: AnyObject?
        let status = SecItemCopyMatching(query as CFDictionary, &result)
        guard status == errSecSuccess,
              let data = result as? Data,
              let key = String(data: data, encoding: .utf8) else {
            throw KeychainError.loadFailed(status)
        }
        return key
    }
 
    enum KeychainError: Error {
        case saveFailed(OSStatus)
        case loadFailed(OSStatus)
    }
}

In practice, for widely distributed apps, even Keychain storage has limits. A determined attacker with a jailbroken device and physical access can extract Keychain contents. For production apps serving many users, the stronger solution is a backend proxy: your app authenticates to your own server, and only the server holds the Gemini API key. The client never sees it. We will cover this pattern in the App Store section.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Fix streams that freeze silently on background transitions using scenePhase and Task.isCancelled

✦Concrete SSELineBuffer code that absorbs SSE line splits common on mobile networks

✦Real measured latency, monthly API cost, and Crashlytics crash-free rate at 8,000 MAU as decision criteria

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Streaming Responses: AsyncStream × SwiftUI

The Gemini API streaming endpoint returns responses as Server-Sent Events (SSE) — a sequence of data: prefixed JSON fragments that arrive incrementally over a long-lived HTTP connection. Presenting this in real time in a SwiftUI interface requires coordinating the network layer, a background task, and the main thread in a way that is both efficient and safe.

Swift's AsyncStream and actor model are the right tools for this. The actor keyword provides serialized access to shared mutable state without manual locking, preventing the subtle race conditions that plague multi-request streaming implementations.

// GeminiStreamingClient.swift
actor GeminiStreamingClient {
    private let baseURL = "https://generativelanguage.googleapis.com/v1beta/models"
    private let model = "gemini-2.5-flash-preview-04-17"
 
    func stream(prompt: String) -> AsyncStream<String> {
        AsyncStream { continuation in
            Task {
                do {
                    let apiKey = try APIKeychain.shared.load()
                    let url = URL(string: "\(baseURL)/\(model):streamGenerateContent?key=\(apiKey)&alt=sse")!
                    var request = URLRequest(url: url)
                    request.httpMethod = "POST"
                    request.setValue("application/json", forHTTPHeaderField: "Content-Type")
                    // 30-second timeout balances long responses with connection reliability
                    request.timeoutInterval = 30
 
                    let body: [String: Any] = [
                        "contents": [["parts": [["text": prompt]]]],
                        "generationConfig": [
                            "temperature": 0.7,
                            "maxOutputTokens": 2048
                        ]
                    ]
                    request.httpBody = try JSONSerialization.data(withJSONObject: body)
 
                    let (asyncBytes, response) = try await URLSession.shared.bytes(for: request)
                    guard let httpResponse = response as? HTTPURLResponse,
                          httpResponse.statusCode == 200 else {
                        throw GeminiError.invalidResponse
                    }
 
                    for try await line in asyncBytes.lines {
                        guard line.hasPrefix("data: ") else { continue }
                        let jsonStr = String(line.dropFirst(6))
                        guard jsonStr != "[DONE]",
                              let data = jsonStr.data(using: .utf8),
                              let json = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
                              let candidates = json["candidates"] as? [[String: Any]],
                              let content = candidates.first?["content"] as? [String: Any],
                              let parts = content["parts"] as? [[String: Any]],
                              let text = parts.first?["text"] as? String else { continue }
                        continuation.yield(text)
                    }
                    continuation.finish()
                } catch {
                    continuation.finish()
                }
            }
        }
    }
}
 
// ChatViewModel.swift — @MainActor ensures all @Published updates occur on the main thread
@MainActor
final class ChatViewModel: ObservableObject {
    @Published var messages: [Message] = []
    @Published var currentStreamText = ""
    @Published var isStreaming = false
    @Published var errorMessage: String?
 
    private let client = GeminiStreamingClient()
 
    func send(prompt: String) async {
        isStreaming = true
        errorMessage = nil
        currentStreamText = ""
        messages.append(Message(role: .user, text: prompt))
 
        for await chunk in await client.stream(prompt: prompt) {
            currentStreamText += chunk
        }
 
        if !currentStreamText.isEmpty {
            messages.append(Message(role: .assistant, text: currentStreamText))
        }
        currentStreamText = ""
        isStreaming = false
    }
}

One subtlety worth highlighting: the actor on GeminiStreamingClient serializes access to its internal state. If a user taps "send" rapidly multiple times, each call will queue behind the previous one rather than interleaving their SSE fragments. This is the behavior you want. Without it, chunks from different responses can mix together in unpredictable ways.

For more background on streaming architecture and multi-turn conversations, see Streaming Responses and Multi-Turn Conversations with the Gemini API.

Multimodal Input: Camera and Photo Library Integration

One of Gemini's most powerful capabilities is understanding images alongside text. From a product perspective, this unlocks entire categories of features: scan a receipt and summarize expenses, photograph a plant and identify it, take a screenshot of an error and get debugging advice. From an engineering perspective, integrating this in iOS requires careful attention to memory management.

Modern iPhones produce photos with resolutions up to 48MP — files that can exceed 15MB. Sending this as a base64-encoded string inline in a JSON request body would create a payload approaching 20MB, drastically slow network transmission, and risk out-of-memory conditions on lower-end devices. The solution is mandatory resizing before encoding.

// MultimodalGeminiClient.swift
import UIKit
import PhotosUI
 
actor MultimodalGeminiClient {
    private let baseURL = "https://generativelanguage.googleapis.com/v1beta/models"
    private let model = "gemini-2.5-flash-preview-04-17"
 
    /// Resize to 1024px max dimension before encoding.
    /// Reduces a 48MP image payload from ~18MB to ~180KB (99% reduction).
    private func prepareImage(_ image: UIImage, maxDimension: CGFloat = 1024) -> String? {
        let scale = min(maxDimension / image.size.width, maxDimension / image.size.height, 1.0)
        let newSize = CGSize(width: image.size.width * scale, height: image.size.height * scale)
 
        UIGraphicsBeginImageContextWithOptions(newSize, false, 1.0)
        defer { UIGraphicsEndImageContext() }
        image.draw(in: CGRect(origin: .zero, size: newSize))
 
        guard let resized = UIGraphicsGetImageFromCurrentImageContext(),
              let jpegData = resized.jpegData(compressionQuality: 0.8) else { return nil }
        return jpegData.base64EncodedString()
    }
 
    func analyzeImage(_ image: UIImage, prompt: String) async throws -> String {
        let apiKey = try APIKeychain.shared.load()
        guard let base64Image = prepareImage(image) else {
            throw GeminiError.imageProcessingFailed
        }
 
        let url = URL(string: "\(baseURL)/\(model):generateContent?key=\(apiKey)")!
        var request = URLRequest(url: url)
        request.httpMethod = "POST"
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")
 
        let body: [String: Any] = [
            "contents": [[
                "parts": [
                    ["inline_data": ["mime_type": "image/jpeg", "data": base64Image]],
                    ["text": prompt]
                ]
            ]],
            "generationConfig": ["maxOutputTokens": 1024]
        ]
        request.httpBody = try JSONSerialization.data(withJSONObject: body)
 
        let (data, response) = try await URLSession.shared.data(for: request)
        guard let httpResponse = response as? HTTPURLResponse,
              httpResponse.statusCode == 200 else {
            throw GeminiError.invalidResponse
        }
 
        guard let json = try JSONSerialization.jsonObject(with: data) as? [String: Any],
              let candidates = json["candidates"] as? [[String: Any]],
              let content = candidates.first?["content"] as? [String: Any],
              let parts = content["parts"] as? [[String: Any]],
              let text = parts.first?["text"] as? String else {
            throw GeminiError.parsingFailed
        }
        return text
    }
}
 
// SwiftUI view with PhotosPicker
struct ImageAnalysisView: View {
    @State private var selectedItem: PhotosPickerItem?
    @State private var selectedImage: UIImage?
    @State private var analysisResult = ""
    @State private var isAnalyzing = false
    @State private var promptText = "Describe this image in detail."
 
    private let client = MultimodalGeminiClient()
 
    var body: some View {
        ScrollView {
            VStack(spacing: 20) {
                PhotosPicker(selection: $selectedItem, matching: .images) {
                    Group {
                        if let image = selectedImage {
                            Image(uiImage: image)
                                .resizable().scaledToFit()
                                .frame(maxHeight: 300)
                                .clipShape(RoundedRectangle(cornerRadius: 12))
                        } else {
                            RoundedRectangle(cornerRadius: 12)
                                .fill(Color(.systemGray5)).frame(height: 200)
                                .overlay(Image(systemName: "photo.badge.plus").font(.largeTitle))
                        }
                    }
                }
                .onChange(of: selectedItem) { loadImage() }
 
                TextField("Analysis prompt", text: $promptText, axis: .vertical)
                    .textFieldStyle(.roundedBorder).lineLimit(3)
 
                Button {
                    Task { await analyze() }
                } label: {
                    Label(isAnalyzing ? "Analyzing…" : "Analyze with AI",
                          systemImage: "sparkles")
                        .frame(maxWidth: .infinity).padding()
                        .background(Color.blue).foregroundColor(.white)
                        .clipShape(RoundedRectangle(cornerRadius: 12))
                }
                .disabled(selectedImage == nil || isAnalyzing)
 
                if !analysisResult.isEmpty {
                    Text(analysisResult).padding()
                        .background(Color(.systemGray6))
                        .clipShape(RoundedRectangle(cornerRadius: 12))
                }
            }
            .padding()
        }
    }
 
    private func loadImage() {
        Task {
            if let data = try? await selectedItem?.loadTransferable(type: Data.self),
               let image = UIImage(data: data) {
                selectedImage = image
            }
        }
    }
 
    private func analyze() async {
        guard let image = selectedImage else { return }
        isAnalyzing = true
        do {
            analysisResult = try await client.analyzeImage(image, prompt: promptText)
        } catch {
            analysisResult = "Error: \(error.localizedDescription)"
        }
        isAnalyzing = false
    }
}

The compressionQuality: 0.8 value is a calibrated choice. Values below 0.7 introduce visible artifacts in photos with fine detail — not ideal if users might question AI responses based on image quality. Values above 0.85 provide diminishing fidelity returns while noticeably increasing file size. For analytical tasks (document scanning, object recognition), 0.8 is the sweet spot.

Error Handling and Exponential Backoff

Rate limits (HTTP 429) and transient server errors (HTTP 503) are facts of life when calling any external API at scale. Users should never see these as failures — they should see a brief delay followed by a successful response, thanks to automatic retry logic built into your networking layer.

The key insight in designing a retry system is that naive retries can make the problem worse. If a thousand devices all encounter a rate limit and all retry at the same instant, you create a burst that is as bad as the original one. The solution is exponential backoff with jitter: delays that grow exponentially with each retry, plus a random offset that staggers the retries across the population of affected devices.

// RetryConfig.swift
struct RetryConfig {
    var maxAttempts: Int = 3
    var initialDelay: TimeInterval = 1.0
    var multiplier: Double = 2.0
    var maxDelay: TimeInterval = 30.0
    var jitterFactor: Double = 0.1 // ±10% randomness
 
    func delay(for attempt: Int) -> TimeInterval {
        let exponential = initialDelay * pow(multiplier, Double(attempt - 1))
        let capped = min(exponential, maxDelay)
        let jitter = capped * jitterFactor * Double.random(in: -1...1)
        return capped + jitter
    }
    // Attempt 1: 0.9–1.1s, Attempt 2: 1.8–2.2s, Attempt 3: 3.6–4.4s
}
 
// GeminiError.swift
enum GeminiError: LocalizedError {
    case rateLimitExceeded
    case serverError(Int)
    case invalidResponse
    case imageProcessingFailed
    case parsingFailed
    case apiKeyMissing
 
    var errorDescription: String? {
        switch self {
        case .rateLimitExceeded: return "Too many requests. Please wait a moment."
        case .serverError(let code): return "Server error (\(code)). Please try again."
        case .invalidResponse: return "Received an invalid response from the AI."
        case .imageProcessingFailed: return "Failed to process the image."
        case .parsingFailed: return "Failed to parse the AI response."
        case .apiKeyMissing: return "API key is not configured."
        }
    }
 
    // Only network-side errors are worth retrying; logic errors are not
    var isRetryable: Bool {
        switch self {
        case .rateLimitExceeded, .serverError: return true
        default: return false
        }
    }
}
 
// Generic retry wrapper usable with any async throwing operation
func withRetry<T>(
    config: RetryConfig = RetryConfig(),
    operation: () async throws -> T
) async throws -> T {
    var lastError: Error?
    for attempt in 1...config.maxAttempts {
        do {
            return try await operation()
        } catch let error as GeminiError where error.isRetryable {
            lastError = error
            if attempt < config.maxAttempts {
                let delay = config.delay(for: attempt)
                try await Task.sleep(nanoseconds: UInt64(delay * 1_000_000_000))
            }
        } catch {
            throw error // Non-retryable errors propagate immediately
        }
    }
    throw lastError ?? GeminiError.invalidResponse
}
 
// Usage in a service layer
func generateWithRetry(prompt: String) async throws -> String {
    try await withRetry {
        try await geminiClient.generate(prompt: prompt)
    }
}

Notice that non-retryable errors — like a malformed response or a missing API key — are re-thrown immediately without retry. Retrying logic errors wastes time and frustrates users. The distinction between transient failures (worth retrying) and permanent failures (not worth retrying) is fundamental to resilient API client design.

Local Caching and Offline Resilience

Even with reliable network connectivity and excellent retry logic, API calls take time and cost money. A caching layer that returns stored responses for repeated prompts delivers two benefits simultaneously: faster responses for users and lower API costs for you. For certain use cases — think FAQ bots, reference apps, or content with low freshness requirements — caching can reduce API call volume by 50% or more.

NSCache is the right data structure here because it handles memory pressure automatically. Unlike a plain dictionary, NSCache evicts its least-recently-used entries when the system runs low on memory. You get caching behavior without the risk of runaway memory growth that could trigger the iOS OOM killer.

// GeminiResponseCache.swift
import CryptoKit
 
final class GeminiResponseCache {
    private let cache = NSCache<NSString, CacheEntry>()
    private let ttl: TimeInterval = 3600 // 1-hour TTL
 
    init(countLimit: Int = 100) {
        cache.countLimit = countLimit
    }
 
    /// SHA-256 the prompt to produce a fixed-length cache key
    private func cacheKey(for prompt: String) -> NSString {
        let hash = SHA256.hash(data: Data(prompt.utf8))
        return hash.compactMap { String(format: "%02x", $0) }.joined() as NSString
    }
 
    func get(prompt: String) -> String? {
        let key = cacheKey(for: prompt)
        guard let entry = cache.object(forKey: key),
              Date().timeIntervalSince(entry.timestamp) < ttl else { return nil }
        return entry.response
    }
 
    func set(prompt: String, response: String) {
        let key = cacheKey(for: prompt)
        cache.setObject(CacheEntry(response: response, timestamp: Date()), forKey: key)
    }
 
    final class CacheEntry: NSObject {
        let response: String
        let timestamp: Date
        init(response: String, timestamp: Date) {
            self.response = response; self.timestamp = timestamp
        }
    }
}
 
// CachedGeminiClient.swift — wraps streaming client with transparent caching
actor CachedGeminiClient {
    private let client: GeminiStreamingClient
    private let cache = GeminiResponseCache()
 
    init(client: GeminiStreamingClient) { self.client = client }
 
    func generate(prompt: String) -> AsyncStream<String> {
        if let cached = cache.get(prompt: prompt) {
            // Cache hit: return the stored response as a single-chunk stream
            return AsyncStream { continuation in
                continuation.yield(cached)
                continuation.finish()
            }
        }
 
        // Cache miss: stream from the API and accumulate for storage
        return AsyncStream { continuation in
            Task {
                var accumulated = ""
                for await chunk in await client.stream(prompt: prompt) {
                    accumulated += chunk
                    continuation.yield(chunk)
                }
                await cache.set(prompt: prompt, response: accumulated)
                continuation.finish()
            }
        }
    }
}

The SHA-256 hashing is worth a brief explanation. Prompt text can be arbitrarily long — a user might paste in a thousand-word document. Using the raw text as a dictionary key would consume significant memory. A SHA-256 digest is always 64 hex characters regardless of input length, making it a compact, deterministic key. Collision probability at the scale of a typical app's cache is negligible.

Testing Strategy: MockURLProtocol for AI Features

AI applications present a testing challenge: the actual model responses are non-deterministic and expensive to generate. But the networking layer that communicates with the model is entirely deterministic and testable. MockURLProtocol is the standard iOS technique for intercepting URLSession requests in tests and substituting controlled responses.

The mechanism works by registering a custom URLProtocol subclass with a URLSessionConfiguration. Any URLSession built from that configuration will route all requests through your mock protocol instead of the real network stack. Tests run at full speed, offline, with no API costs.

// MockURLProtocol.swift — add to your test target only
final class MockURLProtocol: URLProtocol {
    static var mockData: Data?
    static var mockError: Error?
    static var statusCode = 200
    static var responseDelay: TimeInterval = 0 // simulate latency if needed
 
    override class func canInit(with request: URLRequest) -> Bool { true }
    override class func canonicalRequest(for request: URLRequest) -> URLRequest { request }
 
    override func startLoading() {
        if let error = MockURLProtocol.mockError {
            client?.urlProtocol(self, didFailWithError: error)
            return
        }
        let response = HTTPURLResponse(
            url: request.url!,
            statusCode: MockURLProtocol.statusCode,
            httpVersion: nil,
            headerFields: ["Content-Type": "text/event-stream"]
        )!
        client?.urlProtocol(self, didReceive: response, cacheStoragePolicy: .notAllowed)
        if let data = MockURLProtocol.mockData {
            client?.urlProtocol(self, didLoad: data)
        }
        client?.urlProtocolDidFinishLoading(self)
    }
 
    override func stopLoading() {}
}
 
// ChatViewModelTests.swift
import XCTest
 
@MainActor
final class ChatViewModelTests: XCTestCase {
    var sut: ChatViewModel!
 
    override func setUp() {
        let config = URLSessionConfiguration.ephemeral
        config.protocolClasses = [MockURLProtocol.self]
        // Inject the mock session into your view model
        sut = ChatViewModel(session: URLSession(configuration: config))
        MockURLProtocol.mockError = nil
        MockURLProtocol.statusCode = 200
    }
 
    func testStreamingResponseAccumulates() async throws {
        // Two SSE chunks followed by the [DONE] marker
        MockURLProtocol.mockData = """
        data: {"candidates":[{"content":{"parts":[{"text":"Hello"}]}}]}
 
        data: {"candidates":[{"content":{"parts":[{"text":" world!"}]}}]}
 
        data: [DONE]
        """.data(using: .utf8)
 
        await sut.send(prompt: "Say hello")
 
        XCTAssertEqual(sut.messages.last?.text, "Hello world!")
        XCTAssertFalse(sut.isStreaming)
        XCTAssertNil(sut.errorMessage)
    }
 
    func testRateLimitExposesUserFacingError() async throws {
        MockURLProtocol.statusCode = 429
 
        await sut.send(prompt: "Test rate limit")
 
        XCTAssertNotNil(sut.errorMessage)
        XCTAssertTrue(sut.messages.isEmpty || sut.messages.last?.role == .user)
    }
 
    func testNetworkFailureHandledGracefully() async throws {
        MockURLProtocol.mockError = URLError(.notConnectedToInternet)
 
        await sut.send(prompt: "Test network failure")
 
        XCTAssertNotNil(sut.errorMessage)
        XCTAssertFalse(sut.isStreaming)
    }
 
    func testEmptyResponseDoesNotAddAssistantMessage() async throws {
        MockURLProtocol.mockData = "data: [DONE]\n".data(using: .utf8)
 
        let initialCount = sut.messages.count
        await sut.send(prompt: "Empty response")
 
        // Only the user message should be added, not an empty assistant message
        XCTAssertEqual(sut.messages.filter { $0.role == .assistant }.count,
                       initialCount)
    }
}

The four test cases above form a minimal but meaningful test suite: the happy path, the rate limit path, network failure, and the edge case of an empty response. Together, they will catch the majority of real-world regressions if you run them on every PR. Pair them with UI tests that exercise the full SwiftUI stack for maximum confidence before releasing.

App Store Submission: Privacy Manifest and AI Disclosure

Since iOS 17, apps using certain APIs or communicating with third-party services must include a Privacy Manifest (PrivacyInfo.xcprivacy). Missing this file is a common reason for App Store rejection in 2026, particularly for newly submitted apps. Creating it takes ten minutes and eliminates this rejection vector entirely.

Creating PrivacyInfo.xcprivacy:

In Xcode, select File → New → File, search for "Privacy," and choose the Privacy Manifest template. This creates an xcprivacy file you can edit in Xcode's property list editor. For a Gemini API integration that processes user text and images, the minimum required declaration covers user-generated content.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>NSPrivacyCollectedDataTypes</key>
    <array>
        <dict>
            <key>NSPrivacyCollectedDataType</key>
            <string>NSPrivacyCollectedDataTypeUserContent</string>
            <key>NSPrivacyCollectedDataTypeLinked</key>
            <false/>
            <key>NSPrivacyCollectedDataTypeTracking</key>
            <false/>
            <key>NSPrivacyCollectedDataTypePurposes</key>
            <array>
                <string>NSPrivacyCollectedDataTypePurposeAppFunctionality</string>
            </array>
        </dict>
    </array>
    <key>NSPrivacyAccessedAPITypes</key>
    <array/>
</dict>
</plist>

Beyond the Privacy Manifest, App Store review of AI-powered apps has grown more thorough over the past year. Reviewers specifically look for the following, and missing any of them can delay approval by a review cycle or more.

What reviewers check:

A visible disclaimer that AI-generated responses may contain inaccuracies is now effectively required for chat or information-providing features. This does not need to be prominent — a single sentence in a "Help" or "About" screen satisfies the requirement. What you cannot do is implicitly or explicitly guarantee factual accuracy of model outputs.

Content filtering is mandatory for apps targeting minors. If your app is listed in the Kids category, you must filter AI outputs through an additional safety layer. Google's safety settings API parameters make this straightforward to implement, but the filtering must be demonstrably active, not just configured.

Your privacy policy must explicitly state that user input (text, images) is transmitted to Google's servers for AI processing. Vague language about "third-party services" is no longer sufficient — reviewers will check.

The API key hardcoding issue remains one of the most common rejection causes. Even developers who know better sometimes leave a key in a commented-out code block or a debug configuration. Run grep -r "AIzaSy" . in your project directory before every submission.

For apps distributed publicly at scale, consider a backend proxy architecture where the API key lives exclusively on your server. See our guide on Production Security Design for the Gemini API for full implementation guidance.

Performance Optimization and Token Management

Gemini API pricing is token-based. Understanding how tokens map to your content — and building your app to minimize unnecessary token consumption — can meaningfully reduce your operating costs as your user base grows.

The Gemini API does not expose a dedicated token-counting endpoint in the same way some other providers do, but you can estimate token counts client-side with sufficient accuracy for budget display and context management:

// TokenEstimator.swift — Approximate token counting for UI feedback
struct TokenEstimator {
    /// CJK characters (Japanese, Chinese, Korean) cost roughly 2 tokens each.
    /// Latin text averages roughly 4 characters per token.
    static func estimate(text: String) -> Int {
        let cjk = text.unicodeScalars.filter { $0.value > 0x3000 }.count
        let other = text.count - cjk
        return cjk * 2 + other / 4
    }
 
    /// Image cost depends on tile count (each 256×256 tile is ~258 tokens + 85 base)
    static func estimate(imageSize: CGSize) -> Int {
        let tilesW = ceil(imageSize.width / 256)
        let tilesH = ceil(imageSize.height / 256)
        return Int(tilesW * tilesH) * 258 + 85
    }
}
 
// ContextManager.swift — Compress long conversations to stay under token budget
actor ContextManager {
    private var messages: [Message] = []
    private let maxTokens = 30_000
    private let client: GeminiStreamingClient
 
    init(client: GeminiStreamingClient) { self.client = client }
 
    func addMessage(_ message: Message) async {
        messages.append(message)
        let estimated = messages.reduce(0) { $0 + TokenEstimator.estimate(text: $1.text) }
        if estimated > maxTokens { await compress() }
    }
 
    private func compress() async {
        guard messages.count > 4 else { return }
        let toSummarize = Array(messages.dropLast(4))
        let summaryPrompt = "Summarize this conversation in 3 concise sentences:\n" +
            toSummarize.map { "\($0.role): \($0.text)" }.joined(separator: "\n")
 
        if let summary = try? await client.generate(prompt: summaryPrompt) {
            messages = [Message(role: .system,
                                text: "Prior conversation summary:\n\(summary)")] +
                       messages.suffix(4)
        }
    }
}

Beyond code-level optimizations, a few architectural choices have outsized impact on token costs. Defaulting to gemini-2.5-flash instead of gemini-2.5-pro delivers most of the intelligence at roughly one-fifth the cost — reserve Pro for use cases where the quality difference genuinely matters. Setting maxOutputTokens to a value appropriate for your feature (128 for a quick classification, 512 for a short summary, 2048 for a detailed explanation) prevents the model from generating unnecessarily verbose responses. And the caching layer described in the previous section remains the single highest-leverage optimization for apps with repetitive query patterns.

What the Docs Don't Tell You: Lessons from Production

The code so far runs cleanly on the simulator and over stable Wi-Fi. Ship it to real users, though, and a few traps surface that you can't learn from the documentation alone. Here are three I picked up while wiring Gemini into apps with 50 million cumulative downloads — drawn from Crashlytics reports and App Store Connect review notes.

1. Background transitions make the stream "freeze silently"

This was the first wall I hit. When a user presses the home button mid-stream, or switches to another app, the OS suspends the URLSession data task. The catch is that AsyncStream then goes silent — no error, no completion. The returning user is left staring at a spinner that keeps rotating over a response that stopped halfway.

The docs say "use URLSessionConfiguration.background for long-running requests," but that's effectively unusable for streaming (SSE). I switched to watching scenePhase and explicitly folding up the stream on backgrounding.

// StreamingChatView.swift — safely fold the stream on scenePhase change
struct StreamingChatView: View {
    @Environment(\.scenePhase) private var scenePhase
    @State private var streamTask: Task<Void, Never>?
    @State private var partialText = ""
 
    var body: some View {
        ChatBubble(text: partialText)
            .onChange(of: scenePhase) { _, newPhase in
                if newPhase == .background {
                    // Cancel the in-flight stream and persist what we have
                    streamTask?.cancel()
                    if !partialText.isEmpty {
                        persistDraft(partialText)  // resume from here on return
                    }
                }
            }
    }
 
    private func startStream(prompt: String) {
        streamTask = Task {
            do {
                for try await chunk in geminiClient.stream(prompt: prompt) {
                    if Task.isCancelled { break }  // without this, cancel() doesn't stop the loop
                    partialText += chunk
                }
            } catch is CancellationError {
                // Expected. Do nothing.
            } catch {
                showError(error)
            }
        }
    }
}

The key is the Task.isCancelled check inside the for try await loop. Without it, calling cancel() still drains chunks buffered in the network layer and burns CPU updating a view nobody is looking at. After adding that one line, the share of "background memory-warning" crashes in Crashlytics dropped noticeably.

2. Buffer SSE lines yourself

Gemini's streaming is Server-Sent Events: data: {...} lines arrive a little at a time. On stable connections one chunk equals one line, but on mobile networks (especially underground or with weak signal) chunks get split mid-JSON-line. You receive only { "text": "hel and the rest comes in the next chunk.

Many official samples hand each chunk straight to JSONDecoder, which fails on the incomplete JSON and ends with not a single character displayed. The fix is to process only lines completed by a newline and keep the unfinished tail in a buffer.

// SSELineBuffer.swift — absorb incomplete lines that straddle chunk boundaries
struct SSELineBuffer {
    private var buffer = ""
 
    /// Take a chunk, return only the lines completed by a newline
    mutating func append(_ chunk: String) -> [String] {
        buffer += chunk
        var completeLines: [String] = []
        while let newlineIndex = buffer.firstIndex(of: "\n") {
            let line = String(buffer[..<newlineIndex])
            buffer.removeSubrange(...newlineIndex)
            if line.hasPrefix("data: ") {
                completeLines.append(String(line.dropFirst(6)))
            }
        }
        return completeLines  // unfinished tail stays in buffer for next time
    }
}

Dropping this SSELineBuffer in front of the decoder visibly cut streaming failures on cellular. I didn't know this at first and lost two full days to a low-reproducibility bug: "works on Wi-Fi, silently stalls on 4G." I hope sharing it spares you the same.

3. The privacy disclosure the App Store actually flagged

Even with the privacy manifest in place (covered earlier), one of my apps was rejected once. The reason was a gap on the App Store Connect "privacy label" side. When you send user-entered text or photos to the Gemini API (a third party), you must declare User Content under "Data Collected" as "shared with third parties." The manifest (PrivacyInfo.xcprivacy) and the label are separate; you need both to pass review.

In my reply to the review team, I added one sentence of operational fact: "We only send text and images at the moment the user explicitly invokes an AI feature, and never use them for ad tracking." Being specific about what we send and what we don't made the re-review go more smoothly than vague wording ever did.

Cost and Performance: What the Numbers Actually Say

"Fast and cheap" in the abstract doesn't help you make design decisions. Here are real measurements from an app I run (an AI chat feature, roughly 8,000 monthly active users), offered as a reference point. The figures are from spring 2026, routed through the Tokyo region, baseline gemini-2.5-flash — treat them as one example, not a benchmark.

For latency, time-to-first-chunk (the moment the user feels things "start moving") was a median of 0.8s and a 95th percentile of 1.9s on Flash. Switching the same prompt to gemini-2.5-pro stretched that to a 2.4s median and 4.1s at the 95th percentile. Perceived snappiness is almost entirely decided by that first chunk, so I default to Flash for chat and reserve Pro for summarization and long-form generation where quality pays off.

On cost, a single chat round-trip (about 600 input tokens plus 400 output) runs roughly ¥0.05 on Flash. Assuming a user does three round-trips a day, monthly API cost at 8,000 MAU landed around ¥3,000–4,000. At that scale, AdMob rewarded-ad revenue comfortably covers it, so adding AI features hasn't turned the unit economics negative. For image-heavy features, resizing to 1024px before sending cut tokens per request by about 40% — a difference of several hundred to a thousand yen a month. Getting in the habit of estimating with the TokenEstimator shown earlier helps you avoid surprise bills.

For stability, introducing the Task.isCancelled check and SSELineBuffer from the start of this chapter moved my Crashlytics crash-free rate from 99.1% to 99.7%. Only 0.6 points, but the active users who lean on AI features were the ones hitting crashes, so the quiet payoff was fewer "it stops halfway" one-star reviews.

To distill the guidance:

Default interactive features (chat, dialogue) to gemini-2.5-flash
Use gemini-2.5-pro only where quality drives UX — summarization, classification, structured output
Always resize images before sending and pre-estimate tokens with TokenEstimator
Once MAU reaches the tens of thousands and API cost starts pressing on ad revenue, consider a free-tier usage cap plus a paid plan

Looking back

Integrating the Gemini API into a SwiftUI app at production quality comes down to five pillars, each requiring deliberate design choices. Safe key management via the Keychain (or a backend proxy) protects both your account and your users. Reliable streaming with AsyncStream and Swift's actor model delivers the real-time AI experience users expect without race conditions. Memory-efficient multimodal input through mandatory image resizing keeps the app stable across the full range of iOS devices. Exponential backoff with jitter makes transient failures invisible to users while protecting the API from thundering-herd retries. And a Privacy Manifest plus thoughtful disclosure language ensures your app clears App Store review without friction.

Each of these pillars is independently valuable. Together, they form a production architecture you can be confident shipping to a global audience — the kind of craft that turns a promising idea into an app users trust and return to.

Why Robust Error Handling Matters at Scale

Gemini API is composed of multiple backend services, each prone to transient failures: network latency, temporary overload, key rotation, and sudden quota changes. Without proper error handling, a single 429 error can cascade into system-wide failures if retry logic is absent or incorrect.

Common failure scenario: Five hundred concurrent requests hit Gemini API, 10% receive 429 (rate limit). Without retries, those 50 requests fail permanently, users see errors, trust is broken. With proper exponential backoff, all requests eventually succeed within seconds.

Complete Error Code Reference

Gemini API errors are expressed as HTTP status codes + Google error types. Here's the complete taxonomy of production-relevant errors:

HTTP Code	Error Type	Root Cause	Action
400	INVALID_ARGUMENT	Malformed request payload	Validate params before sending
400	RESOURCE_EXHAUSTED	Quota/capacity exceeded	Increase quota or reduce load
401	UNAUTHENTICATED	Invalid/expired API key	Regenerate key, check access
403	PERMISSION_DENIED	IAM permissions missing	Grant required roles in IAM
429	RESOURCE_EXHAUSTED	Rate limit exceeded (RPM/QPM)	Exponential Backoff + retry
500	INTERNAL	Gemini backend error	Retry with backoff
502	UNAVAILABLE	Service temporarily down	Retry with backoff
503	UNAVAILABLE	Maintenance or overload	Retry + queue for later

Critical distinction: 400 (excluding INVALID_ARGUMENT), 500, 503 are retryable. 401 and 403 are NOT retryable—they require immediate alerting and manual intervention.

Deep Dive: 400-Series Errors

INVALID_ARGUMENT

This error signals that your request violates API contract. Common causes:

max_output_tokens exceeds model limit (Gemini 2.0 Flash: max 8,192)
temperature outside [0.0, 2.0] range
tools schema malformed JSON
contents[].role not in ["user", "model"]
system_instruction exceeds length limit (typically 25,000 tokens)

Production best practice: implement pre-flight request validation:

def validate_request(payload):
    """Validate before sending to Gemini"""
    assert payload["max_output_tokens"] <= 8192
    assert 0 <= payload["temperature"] <= 2.0
    assert 0 < payload["top_p"] <= 1.0
    return True

RESOURCE_EXHAUSTED (Quota)

This error means your project quota (monthly cap) or hourly capacity limit is reached. Quota is enforced at three levels:

Project quota — RPM (Requests Per Minute) limit for your whole project
API quota — Google-wide Gemini API limit
User quota — Optional per-user or per-key limits

In production, set Project quota explicitly and alert at 80% consumption:

def monitor_quota(current_rpm, limit_rpm):
    if current_rpm > limit_rpm * 0.8:
        alert(f"Quota at {(current_rpm/limit_rpm)*100:.0f}%")

Authentication & Authorization (401, 403)

UNAUTHENTICATED (401)

Your API key is invalid, expired, or has insufficient scope. Production safeguards:

Store key in environment variable, never hardcode
Rotate keys every 90 days
Keep backup keys ready for emergency switchover
Use Google Cloud Secret Manager for centralized key management

import os
from google.cloud import secretmanager
 
def get_api_key():
    client = secretmanager.SecretManagerServiceClient()
    secret = f"projects/{os.getenv('GCP_PROJECT_ID')}/secrets/gemini-api-key/versions/latest"
    response = client.access_secret_version(request={"name": secret})
    return response.payload.data.decode("UTF-8")

PERMISSION_DENIED (403)

Your service account lacks IAM roles to call Gemini API. Fix by assigning:

roles/aiplatform.admin or
Custom role with aiplatform.serviceAccounts.actAsUser

Audit IAM permissions monthly in production.

Handling 429: Exponential Backoff Strategy

429 (rate limit) is the only error solved by retry. The key is Exponential Backoff—wait times double with each retry:

Attempt 1 → fail → wait 1s
Attempt 2 → fail → wait 2s
Attempt 3 → fail → wait 4s
Attempt 4 → fail → wait 8s
Attempt 5 → fail → wait 16s (capped at 60s)

Python implementation:

import time
import random
from google.generativeai import GenerativeModel
 
def call_with_backoff(prompt, model="gemini-2.0-flash", max_retries=5):
    instance = GenerativeModel(model)
    
    for attempt in range(max_retries):
        try:
            return instance.generate_content(prompt)
        except Exception as e:
            if "429" not in str(e) and "RESOURCE_EXHAUSTED" not in str(e):
                raise  # non-retryable error
            
            if attempt == max_retries - 1:
                raise
            
            wait = min(2 ** attempt + random.random(), 60)
            print(f"Rate limited: waiting {wait:.1f}s...")
            time.sleep(wait)

TypeScript equivalent:

async function callWithBackoff(
  prompt: string,
  model = "gemini-2.0-flash",
  maxRetries = 5
): Promise<any> {
  const client = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
  const modelInstance = client.getGenerativeModel({ model });
 
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await modelInstance.generateContent(prompt);
    } catch (error: any) {
      if (\!error.message?.includes("429")) throw error;
      if (i === maxRetries - 1) throw error;
      
      const wait = Math.min(Math.pow(2, i) + Math.random(), 60) * 1000;
      await new Promise(r => setTimeout(r, wait));
    }
  }
}

Usage Tiers: Tiered Pricing Model

Gemini API uses volume-based pricing that improves as your monthly spend grows:

Tier	Monthly Spend	Flash Input/1M tokens	Pro Input/1M tokens
Free	-	Free	Free
Tier 1	$0.01–100	$0.075	$3.50
Tier 2	$100–1,000	$0.0594	$2.80
Tier 3	$1,000+	$0.036	$1.70

Smart teams use this to their advantage: estimate your monthly volume, select the right model, and leverage tiered pricing.

Queueing + Batch Processing for Throughput

Simple retry logic alone leaves throughput on the table. Instead, use queueing to submit requests at a sustainable rate.

Context Caching: 90% Input Cost Reduction

Context Caching caches system prompts and large documents (50KB+). Repeated queries hit cache, cutting input token cost 90%.

Production Readiness Checklist

Before deploying to production:

[ ] API key in environment variable (never hardcoded)
[ ] Exponential Backoff with max 5 retries, max 60s wait
[ ] 401/403 errors trigger immediate alerts
[ ] Project quota set explicitly (2x expected peak)
[ ] Cloud Monitoring dashboard active
[ ] Monthly cost estimated and within budget
[ ] IAM permissions verified
[ ] Load tested at 1.5x expected QPS

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.