●SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMA●FLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasks●IMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25●MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxes●FILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2●DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon●SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMA●FLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasks●IMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25●MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxes●FILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2●DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
Building Production Full-Stack AI Apps with Gemini API & Supabase
A practical guide to building production-grade full-stack AI apps with Gemini API and Supabase—covering auth, pgvector, Edge Functions, RLS, and cost control, plus the tuning lessons (IVFFlat to HNSW recall recovery, the service_role RLS bypass) you only learn in production.
A RAG chat endpoint I had running happily on a Supabase Edge Function suddenly started returning visibly worse matches the moment my document set grew from 10,000 to 120,000 rows — without a single line of code changing. The culprit was the pgvector index configuration, the kind of "only shows up at scale" trap that quickstart docs never mention.
I've been building iOS and Android apps solo since 2014, and the Gemini API + Supabase combination is, in my experience, one of the few stacks an independent developer can actually run in production alone. This guide walks through wiring up auth, pgvector, Edge Functions, RLS, and cost control end to end — and then goes into the tuning decisions you only discover once real traffic hits.
Setup and context
Combining Gemini API with Supabase creates an exceptionally powerful platform for building modern AI applications. Supabase provides an integrated foundation with PostgreSQL, authentication, real-time subscriptions, and Edge Functions, while Gemini API handles text generation, multimodal processing, and embeddings. Together, they enable you to construct scalable, feature-rich AI applications rapidly—from AI chatbots and RAG systems to semantic search platforms.
This guide walks you through building production-grade full-stack AI applications using this combination. You'll learn proven architecture patterns, authentication flows, database design with pgvector, security implementation, and performance optimization techniques that real applications rely on.
Supabase & Gemini Architecture Patterns
A well-designed Supabase + Gemini architecture consists of several interconnected layers:
Frontend Layer
React, Next.js, or similar client application
Real-time UI updates via Supabase Realtime client
Streaming response handling from Gemini API
API & Edge Functions Layer
Supabase Edge Functions (TypeScript/Deno runtime)
Authenticated requests to Gemini API
Request validation and rate limiting
Caching strategies
Data Layer
PostgreSQL (Supabase-hosted)
pgvector extension for semantic vector storage
User data, conversation history, document metadata
Row Level Security (RLS) for multi-tenant isolation
External Services
Gemini API (text generation, embeddings)
Storage (Supabase Storage or S3)
Optional: Redis or Vercel KV for caching
Why This Architecture Works
PostgreSQL with pgvector eliminates the need for a separate vector database—semantic search runs natively in your primary database. Edge Functions enable you to manage Gemini API authentication securely at the edge, minimizing latency. The RLS model ensures data isolation without additional middleware.
This architecture scales gracefully from prototype to millions of users while keeping operational costs reasonable. You get native transaction support, complex queries, and relational integrity that pure vector databases can't match.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦The exact pgvector parameters and trade-offs for moving from IVFFlat to HNSW to recover search recall from 0.78 to 0.93
✦The trap where a service_role key silently bypasses RLS, and how to scope permissions correctly with a user-scoped client
✦Avoiding 429s in embedding batches (concurrency cap + exponential backoff) and the real monthly cost at 8,000 MAU
Secure payment via Stripe · Cancel anytime
Setting Up Supabase and Authentication
Creating Your Supabase Project
Start by creating a new project at supabase.com. Once initialized, capture these credentials in your .env.local:
Implement email-based sign-up with email confirmation:
// lib/auth.tsimport { createClient } from '@supabase/supabase-js'const supabase = createClient( process.env.NEXT_PUBLIC_SUPABASE_URL!, process.env.NEXT_PUBLIC_SUPABASE_ANON_KEY!)export async function signUpWithEmail(email: string, password: string) { const { data, error } = await supabase.auth.signUp({ email, password, options: { emailRedirectTo: `${process.env.NEXT_PUBLIC_APP_URL}/auth/callback`, }, }) if (error) throw new Error(error.message) return data}export async function signInWithEmail(email: string, password: string) { const { data, error } = await supabase.auth.signInWithPassword({ email, password, }) if (error) throw new Error(error.message) return data.session}export async function signOut() { const { error } = await supabase.auth.signOut() if (error) throw new Error(error.message)}
For production, always enable email confirmation or OAuth to prevent unauthorized account creation.
User Profiles Table
Store additional user information beyond Supabase auth:
CREATE TABLE profiles ( id UUID REFERENCES auth.users(id) ON DELETE CASCADE PRIMARY KEY, display_name TEXT, avatar_url TEXT, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW());ALTER TABLE profiles ENABLE ROW LEVEL SECURITY;CREATE POLICY "Users read own profile" ON profiles FOR SELECT USING (auth.uid() = id);CREATE POLICY "Users update own profile" ON profiles FOR UPDATE USING (auth.uid() = id);-- Auto-create profile on signupCREATE FUNCTION handle_new_user()RETURNS TRIGGER AS $$BEGIN INSERT INTO public.profiles (id, display_name) VALUES (new.id, new.email); RETURN new;END;$$ LANGUAGE plpgsql SECURITY DEFINER;CREATE TRIGGER on_auth_user_created AFTER INSERT ON auth.users FOR EACH ROW EXECUTE FUNCTION handle_new_user();
Building Semantic Search with pgvector
Enabling pgvector
Supabase includes pgvector by default. Enable it via the SQL editor:
CREATE EXTENSION IF NOT EXISTS vector;
Documents and Embeddings Schema
Design tables for storing documents and their vector embeddings:
CREATE TABLE documents ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY, user_id UUID REFERENCES auth.users(id) ON DELETE CASCADE NOT NULL, title TEXT NOT NULL, content TEXT NOT NULL, source_url TEXT, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW());CREATE TABLE document_chunks ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY, document_id UUID REFERENCES documents(id) ON DELETE CASCADE NOT NULL, chunk_index INT NOT NULL, content TEXT NOT NULL, -- Gemini embedding-001 produces 768-dimensional vectors embedding vector(768), created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW());-- Create HNSW index for fast semantic searchCREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200);-- Enable RLSALTER TABLE documents ENABLE ROW LEVEL SECURITY;ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;CREATE POLICY "Users access own documents" ON documents FOR SELECT USING (auth.uid() = user_id);CREATE POLICY "Users access own chunks" ON document_chunks FOR SELECT USING ( document_id IN (SELECT id FROM documents WHERE user_id = auth.uid()) );
Semantic Search Query
Query similar chunks using cosine distance:
SELECT dc.id, dc.content, 1 - (dc.embedding <=> query_embedding) AS similarityFROM document_chunks dcWHERE dc.document_id IN ( SELECT id FROM documents WHERE user_id = auth.uid())ORDER BY dc.embedding <=> query_embeddingLIMIT 5;
Wrap this as an RPC function for easier access from Edge Functions:
CREATE FUNCTION search_documents( query_embedding vector, user_id uuid, match_limit int DEFAULT 5)RETURNS TABLE (id uuid, content text, similarity float8) AS $$BEGIN RETURN QUERY SELECT dc.id, dc.content, 1 - (dc.embedding <=> query_embedding)::float8 FROM document_chunks dc WHERE dc.document_id IN ( SELECT id FROM documents WHERE documents.user_id = search_documents.user_id ) ORDER BY dc.embedding <=> query_embedding LIMIT match_limit;END;$$ LANGUAGE plpgsql;
Gemini Embeddings Pipeline
Document Upload → Embedding Workflow
When users upload documents, automatically chunk and embed them:
// supabase/functions/embed-document/index.tsimport { serve } from 'https://deno.land/std@0.168.0/http/server.ts'import { createClient } from 'https://esm.sh/@supabase/supabase-js@2'const supabaseUrl = Deno.env.get('SUPABASE_URL')!const supabaseKey = Deno.env.get('SUPABASE_SERVICE_ROLE_KEY')!const geminiApiKey = Deno.env.get('GEMINI_API_KEY')!const supabase = createClient(supabaseUrl, supabaseKey)// Split text into chunks (~500 tokens each)function chunkText(text: string, maxTokens: number = 500): string[] { const words = text.split(/\s+/) const chunks: string[] = [] let current = '' for (const word of words) { if ((current + ' ' + word).split(' ').length > maxTokens) { chunks.push(current) current = word } else { current += (current ? ' ' : '') + word } } if (current) chunks.push(current) return chunks}// Call Gemini Embedding APIasync function generateEmbedding(text: string): Promise<number[]> { const response = await fetch( 'https://generativelanguage.googleapis.com/v1beta/models/embedding-001:embedContent', { method: 'POST', headers: { 'Content-Type': 'application/json', 'x-goog-api-key': geminiApiKey, }, body: JSON.stringify({ model: 'models/embedding-001', content: { parts: [{ text }] }, }), } ) if (!response.ok) { throw new Error(`Embedding API error: ${response.statusText}`) } const data = await response.json() return data.embedding.values}serve(async (req) => { const { documentId, content, userId } = await req.json() const chunks = chunkText(content, 500) for (let i = 0; i < chunks.length; i++) { const embedding = await generateEmbedding(chunks[i]) const { error } = await supabase.from('document_chunks').insert({ document_id: documentId, chunk_index: i, content: chunks[i], embedding, }) if (error) { console.error('Insert error:', error) return new Response(`Error: ${error.message}`, { status: 500 }) } } return new Response( JSON.stringify({ success: true, chunkCount: chunks.length }), { headers: { 'Content-Type': 'application/json' } } )})
Batch Processing & Rate Limiting
Implement batching to stay within Gemini API rate limits:
serve(async (req) => { // Process pending chunks (max 10 per invocation) const { data: pending } = await supabase .from('document_chunks') .select('id, content') .is('embedding', null) .limit(10) for (const chunk of pending || []) { const embedding = await generateEmbedding(chunk.content) await supabase .from('document_chunks') .update({ embedding }) .eq('id', chunk.id) // Add small delay between requests await new Promise((r) => setTimeout(r, 100)) } return new Response('OK')})
Building RAG with Edge Functions
RAG Chat Endpoint
Create an Edge Function that retrieves relevant documents and uses Gemini to generate answers:
CREATE TABLE conversations ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY, user_id UUID REFERENCES auth.users(id) ON DELETE CASCADE NOT NULL, title TEXT, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW());ALTER TABLE conversations ENABLE ROW LEVEL SECURITY;CREATE POLICY "Users access own conversations" ON conversations FOR SELECT USING (auth.uid() = user_id);CREATE POLICY "Users create conversations" ON conversations FOR INSERT WITH CHECK (auth.uid() = user_id);CREATE POLICY "Users update own conversations" ON conversations FOR UPDATE USING (auth.uid() = user_id);CREATE POLICY "Users delete own conversations" ON conversations FOR DELETE USING (auth.uid() = user_id);-- Messages tableCREATE TABLE messages ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY, conversation_id UUID REFERENCES conversations(id) ON DELETE CASCADE NOT NULL, user_id UUID REFERENCES auth.users(id) ON DELETE CASCADE NOT NULL, role TEXT CHECK (role IN ('user', 'assistant')), content TEXT NOT NULL, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW());ALTER TABLE messages ENABLE ROW LEVEL SECURITY;CREATE POLICY "Users access own conversation messages" ON messages FOR SELECT USING ( conversation_id IN ( SELECT id FROM conversations WHERE user_id = auth.uid() ) );
Token Verification in Edge Functions
Verify JWT tokens before processing requests:
import * as jose from 'https://deno.land/x/jose@v4.14.1/index.ts'async function verifyToken(token: string) { try { const secret = new TextEncoder().encode(Deno.env.get('SUPABASE_JWT_SECRET')!) const verified = await jose.jwtVerify(token, secret) return verified.payload.sub // User ID } catch { throw new Error('Invalid token') }}serve(async (req) => { const authHeader = req.headers.get('Authorization') if (!authHeader) return new Response('Unauthorized', { status: 401 }) const token = authHeader.replace('Bearer ', '') const userId = await verifyToken(token) // Now safely process request with verified userId})
-- Drop old IVFFLAT and create HNSWDROP INDEX IF EXISTS document_chunks_embedding_idx;CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200);-- Update statisticsANALYZE document_chunks;
Monitoring & Observability
Log API usage and track performance:
CREATE TABLE api_usage ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY, user_id UUID REFERENCES auth.users(id), api_type TEXT CHECK (api_type IN ('embedding', 'generation')), tokens_used INT, cost DECIMAL(10, 6), created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW());-- Monthly aggregationCREATE VIEW monthly_usage ASSELECT user_id, DATE_TRUNC('month', created_at) AS month, api_type, SUM(tokens_used) AS total_tokens, SUM(cost) AS total_costFROM api_usageGROUP BY user_id, DATE_TRUNC('month', created_at), api_type;
Cost Management & Monitoring
Reducing Gemini API Expenses
Batch embeddings: Process multiple texts in one request
What the docs don't tell you: lessons from production
Everything above runs as-is, but once your data grows or real users show up, you'll need adjustments the quickstart never covers. Here they are, in the order I actually hit them.
1. Revisit your pgvector index once row counts grow
IVFFlat with lists = 100 was fine at first. But somewhere around 120,000 documents (up from 10,000), the same query started surfacing noticeably worse results. Measuring recall by hand (how many of my known top-5 came back), it had dropped from ~0.95 to ~0.78.
IVFFlat requires you to scale lists with your data (a rough rule is rows / 1000); leave it alone and the search clusters get too coarse and miss hits. For an indie app where row counts are unpredictable, switching to HNSW — which needs no count-dependent tuning — made operations much simpler.
-- Move from IVFFlat (needs lists re-tuning as rows grow)-- to HNSW (count-independent, stable recall)DROP INDEX IF EXISTS documents_embedding_idx;CREATE INDEX documents_embedding_idx ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);-- Tune precision/speed per session at query timeSET hnsw.ef_search = 40; -- default 40; raise to 80-100 only when you need more recall
After the switch, recall recovered to ~0.93 on the same data. The trade-off: roughly 2x index build time and ~30% more storage. For a read-heavy RAG workload, that's an easy trade.
2. The service_role key silently bypasses RLS
This one gave me a scare. If you use a service_role Supabase client inside an Edge Function, the Row Level Security you carefully set up is completely ignored. One day, reading logs, I realized the setup could return one user's conversations mixed into another's.
The fix: verify the request JWT and touch the database through a user-scoped client that carries that JWT. Reserve service_role for admin-only work like embedding generation.
// This bypasses RLS — a multi-tenant accident waiting to happenconst admin = createClient(SUPABASE_URL, SERVICE_ROLE_KEY)// Forward the request's Authorization header// -> RLS automatically scopes queries to that user's rowsconst userClient = createClient(SUPABASE_URL, ANON_KEY, { global: { headers: { Authorization: req.headers.get('Authorization')! } },})const { data: { user } } = await userClient.auth.getUser()if (!user) { return new Response('Unauthorized', { status: 401 })}// All reads/writes via userClient are now confined to this user by RLS
To stay "deny by default," separate your admin and user clients clearly and keep the number of functions that touch service_role small enough to count.
3. Design embedding batches assuming 429s
During the initial bulk load, text-embedding-004 returned 429 (rate limited) almost immediately. Running without a concurrency cap, I plateaued around 1,500 requests per minute. Capping concurrency at 5 and adding exponential backoff let the job run to completion without stalling.
// Cap concurrency and absorb 429s with exponential backoffasync function embedWithRetry(text: string, attempt = 0): Promise<number[]> { try { const res = await ai.models.embedContent({ model: 'text-embedding-004', contents: text, }) return res.embeddings[0].values } catch (e: any) { if (e.status === 429 && attempt < 5) { const waitMs = Math.min(2 ** attempt * 500, 16000) + Math.random() * 300 await new Promise((r) => setTimeout(r, waitMs)) return embedWithRetry(text, attempt + 1) } throw e }}
The key detail is the small random jitter on the retry delay. If several workers back off on the same cycle, they'll all hit 429 again together.
4. Stream from Edge Functions — return early
Supabase Edge Functions (Deno) have an execution-time ceiling, and waiting for Gemini's entire response before returning can get truncated on long generations. Piping tokens out through a ReadableStream as they arrive lowers perceived latency and leaves headroom under the limit. In my RAG chat, time-to-first-token ran around 0.9s median and ~2s at p95 with Flash.
Pre-launch checklist I always run
Is the pgvector index HNSW — or, if IVFFlat, is lists sized to the row count?
Does every DB access in an Edge Function use a user-scoped client (no service_role abuse)?
Do embedding batches have a concurrency cap and backoff?
Are Gemini responses streamed and returned early?
Do errors land in something like function_logs and feed an alert?
What it actually costs
For reference, a small app with around 8,000 monthly active users runs me roughly ¥4,000-6,000 / month combined for Gemini (embeddings + Flash for RAG) and Supabase (Pro plan). I deliberately design the in-app AdMob revenue to cover this infrastructure, so each month I check this number to confirm per-user inference cost stays under ad ARPU.
Embeddings are a one-time cost you reuse, so most of the spend is on the RAG response side. Caching frequent questions, using Flash for summaries, and routing only genuinely hard reasoning to Pro kept costs down 30-40% with no noticeable quality drop.
Wrapping up — your next step
Wire up one RAG chat endpoint on a small dataset, then deliberately grow your documents 10x and measure how recall and response time shift. The issues that only appear when scale changes are exactly the ones that matter in production.
I hope this helps anyone trying to take an indie app all the way to production on their own. Thanks for reading.
Setup and context — Why Nuxt 3 × Gemini API?
Nuxt 3, the flagship full-stack framework for the Vue.js ecosystem, continues to attract a wide range of developers in 2026. Its file-system-based routing, automatic code splitting, and powerful SSR capabilities allow you to manage frontend and backend in a unified codebase — making it a compelling choice for AI-powered applications.
When it comes to integrating AI features, Nuxt 3 has some distinct advantages. The "server routes" feature — where you drop a TypeScript file under server/api/ and it instantly becomes a secure API endpoint — provides an ideal foundation for calling Gemini API without exposing API keys to the client side. Additionally, the Composable pattern (useXxx) abstracts state management in a way that dramatically reduces the complexity of implementing multi-turn chat and real-time streaming UIs.
Nuxt 3's Nitro server engine is another key differentiator. Nitro compiles your server code to run on a wide range of targets — Node.js, Cloudflare Workers, Vercel Edge Functions, AWS Lambda, and more — without changing a single line of application code. This makes Nuxt 3 exceptionally well-suited for global-scale AI applications where latency and deployment flexibility are paramount.
This guide sets out to help you build a Nuxt 3 × Gemini API full-stack AI app that runs stably in production — not just a working demo. We'll walk through the full engineering lifecycle: project setup, secure server route design, SSE streaming, multi-turn conversation management, rate limiting, authentication middleware, deployment to three different hosting targets, error handling with exponential backoff, and cost optimization strategies. By the end, you'll have a blueprint you can adapt to build your own AI products.
Project Setup and Environment Configuration
Creating a Nuxt 3 Project
Start by scaffolding a fresh Nuxt 3 project using the official nuxi CLI. The Gemini JavaScript/TypeScript SDK (@google/genai) is the recommended way to interact with the Gemini API.
# Create a new Nuxt 3 projectnpx nuxi@latest init gemini-nuxt-appcd gemini-nuxt-app# Install the Gemini API SDKnpm install @google/genai# Install Pinia for state management (optional but recommended)npm install pinia @pinia/nuxt# Start the dev servernpm run dev
Your project now has a working Nuxt 3 setup. Open http://localhost:3000 to confirm everything is running.
Project Structure Overview
A well-organized Nuxt 3 AI project typically looks like this:
Never hardcode your Gemini API key. Store it in a .env file and access it exclusively from the server side using Nuxt's runtimeConfig.
# .env — add this to .gitignore immediatelyGEMINI_API_KEY=YOUR_GEMINI_API_KEY
// nuxt.config.tsexport default defineNuxtConfig({ modules: ['@pinia/nuxt'], runtimeConfig: { // Server-side only — never exposed to the client bundle geminiApiKey: process.env.GEMINI_API_KEY, rateLimitMax: 10, // Requests per window rateLimitWindowMs: 60000, // Window size in ms // Public config: accessible on both client and server public: { appName: 'Gemini Nuxt App', streamingEnabled: true } }, nitro: { // Target Cloudflare Pages for production (change per deploy target) preset: 'cloudflare-pages' }})
With this configuration, runtimeConfig.geminiApiKey is only accessible from within server routes. Attempting to read it in a Vue component returns undefined, so the key can never leak into the client bundle — a critical security property for any production AI application.
Secure Gemini API Integration via Server Routes
Basic Chat Endpoint
The foundation of any Gemini-powered Nuxt app is a secure server route. Here's a production-ready implementation with proper input validation, error mapping, and response shaping.
// server/api/chat.post.tsimport { GoogleGenAI } from '@google/genai'export default defineEventHandler(async (event) => { const config = useRuntimeConfig() const body = await readBody(event) // --- Input Validation --- if (!body.message || typeof body.message !== 'string') { throw createError({ statusCode: 400, statusMessage: 'Bad Request', message: 'message field is required and must be a string' }) } if (body.message.trim().length === 0) { throw createError({ statusCode: 400, message: 'message cannot be empty' }) } if (body.message.length > 8000) { throw createError({ statusCode: 400, message: 'message exceeds maximum length of 8000 characters' }) } // --- Gemini API Call --- const genai = new GoogleGenAI({ apiKey: config.geminiApiKey }) try { const response = await genai.models.generateContent({ model: 'gemini-2.5-flash', contents: body.message, config: { systemInstruction: body.systemInstruction || 'You are a helpful, knowledgeable assistant.', maxOutputTokens: 2048, temperature: body.temperature ?? 0.7, candidateCount: 1 } }) return { text: response.text, usageMetadata: { promptTokens: response.usageMetadata?.promptTokenCount, outputTokens: response.usageMetadata?.candidatesTokenCount, totalTokens: response.usageMetadata?.totalTokenCount }, model: 'gemini-2.5-flash', timestamp: new Date().toISOString() } } catch (error: any) { // Map Gemini API errors to appropriate HTTP status codes const statusMap: Record<number, number> = { 400: 400, // Invalid request 401: 401, // Bad API key 403: 403, // Permission denied 404: 404, // Model not found 429: 429, // Quota exceeded 500: 502, // Gemini server error → 502 Bad Gateway 503: 503 // Service unavailable } throw createError({ statusCode: statusMap[error.status] ?? 500, message: error.status === 429 ? 'Gemini API quota exceeded. Please wait before retrying.' : `AI generation failed: ${error.message}` }) }})
Implementing Streaming Responses with Server-Sent Events
Streaming is one of the most impactful factors for AI app UX. Instead of waiting for the full response before rendering, displaying tokens as they're generated dramatically improves perceived speed and keeps users engaged. With Nuxt 3's sendStream helper and the browser's native ReadableStream API, implementing SSE streaming is surprisingly clean.
Streaming-Enabled Server Route
// server/api/chat-stream.post.tsimport { GoogleGenAI } from '@google/genai'export default defineEventHandler(async (event) => { const config = useRuntimeConfig() const body = await readBody(event) if (!body.message) { throw createError({ statusCode: 400, message: 'message is required' }) } // Configure SSE headers setHeader(event, 'Content-Type', 'text/event-stream') setHeader(event, 'Cache-Control', 'no-cache, no-transform') setHeader(event, 'Connection', 'keep-alive') setHeader(event, 'X-Accel-Buffering', 'no') // Critical: disables nginx/proxy buffering const genai = new GoogleGenAI({ apiKey: config.geminiApiKey }) return sendStream(event, async (stream) => { let totalTokens = 0 try { const result = await genai.models.generateContentStream({ model: 'gemini-2.5-flash', contents: body.message, config: { systemInstruction: body.systemInstruction, maxOutputTokens: body.maxTokens ?? 2048, temperature: body.temperature ?? 0.7 } }) for await (const chunk of result) { const text = chunk.text if (text) { stream.write(`data: ${JSON.stringify({ type: 'text', text })}\n\n`) } // Accumulate usage from the final chunk if (chunk.usageMetadata?.totalTokenCount) { totalTokens = chunk.usageMetadata.totalTokenCount } } // Send completion event with metadata stream.write( `data: ${JSON.stringify({ type: 'done', totalTokens })}\n\n` ) } catch (error: any) { stream.write( `data: ${JSON.stringify({ type: 'error', message: error.message })}\n\n` ) } finally { stream.end() } })})
Consuming the Stream with a Composable
Encapsulating stream consumption in a composable makes it reusable across different UI components:
Managing conversation history correctly is essential for a coherent multi-turn experience. Here's a production-ready Pinia store that handles message persistence and token budgeting.
// stores/chat.tsimport { defineStore } from 'pinia'interface Message { id: string role: 'user' | 'model' content: string timestamp: number tokenCount?: number}const MAX_HISTORY_MESSAGES = 20 // Limit to control token costsexport const useChatStore = defineStore('chat', () => { const messages = ref<Message[]>([]) const sessionId = ref(crypto.randomUUID()) const totalTokensUsed = ref(0) function addMessage(role: Message['role'], content: string, tokenCount?: number) { messages.value.push({ id: crypto.randomUUID(), role, content, timestamp: Date.now(), tokenCount }) // Trim history to limit token usage if (messages.value.length > MAX_HISTORY_MESSAGES) { messages.value = messages.value.slice(-MAX_HISTORY_MESSAGES) } if (tokenCount) totalTokensUsed.value += tokenCount } function clearChat() { messages.value = [] sessionId.value = crypto.randomUUID() totalTokensUsed.value = 0 } // Shape history for the Gemini API `contents` field const apiHistory = computed(() => messages.value.map(m => ({ role: m.role, parts: [{ text: m.content }] })) ) return { messages, sessionId, totalTokensUsed, addMessage, clearChat, apiHistory }})
// server/api/chat-multi.post.tsimport { GoogleGenAI } from '@google/genai'export default defineEventHandler(async (event) => { const config = useRuntimeConfig() const { message, history = [], systemInstruction } = await readBody(event) if (!message || typeof message !== 'string') { throw createError({ statusCode: 400, message: 'message is required' }) } const genai = new GoogleGenAI({ apiKey: config.geminiApiKey }) // Build the contents array from history + current message const contents = [ ...history.slice(-20), // Safety cap on history length { role: 'user', parts: [{ text: message }] } ] const response = await genai.models.generateContent({ model: 'gemini-2.5-flash', contents, config: { systemInstruction: systemInstruction || 'You are a helpful, concise assistant.', maxOutputTokens: 2048 } }) return { text: response.text, tokenCount: response.usageMetadata?.totalTokenCount }})
Rate Limiting and Authentication Middleware
Layered Rate Limiting
For a real production deployment, rate limiting should be applied in two layers: per-IP (to prevent abuse) and per-user (to enforce subscription tiers).
// server/middleware/rate-limit.tsinterface RateLimitRecord { count: number resetAt: number}// In production: replace with Cloudflare KV or Redisconst ipCounters = new Map<string, RateLimitRecord>()function getRateLimitConfig(path: string) { // Stricter limits for the streaming endpoint if (path.includes('stream')) return { max: 5, windowMs: 60_000 } return { max: 15, windowMs: 60_000 }}export default defineEventHandler((event) => { if (!event.path.startsWith('/api/chat')) return const { max, windowMs } = getRateLimitConfig(event.path) const ip = getHeader(event, 'cf-connecting-ip') || // Cloudflare real IP getHeader(event, 'x-forwarded-for')?.split(',')[0].trim() || event.node.req.socket.remoteAddress || 'unknown' const now = Date.now() const record = ipCounters.get(ip) if (!record || now > record.resetAt) { ipCounters.set(ip, { count: 1, resetAt: now + windowMs }) return } if (record.count >= max) { const retryAfter = Math.ceil((record.resetAt - now) / 1000) setHeader(event, 'Retry-After', String(retryAfter)) setHeader(event, 'X-RateLimit-Limit', String(max)) setHeader(event, 'X-RateLimit-Remaining', '0') setHeader(event, 'X-RateLimit-Reset', String(Math.ceil(record.resetAt / 1000))) throw createError({ statusCode: 429, message: `Rate limit exceeded. Please retry after ${retryAfter} seconds.` }) } record.count++ setHeader(event, 'X-RateLimit-Remaining', String(max - record.count))})
API Key Authentication for Machine Clients
// server/middleware/auth.tsconst PUBLIC_PATHS = ['/api/health', '/api/status']export default defineEventHandler(async (event) => { if (!event.path.startsWith('/api/chat')) return if (PUBLIC_PATHS.includes(event.path)) return const apiKey = getHeader(event, 'x-api-key') const authHeader = getHeader(event, 'authorization') const bearerToken = authHeader?.startsWith('Bearer ') ? authHeader.slice(7) : null if (!apiKey && !bearerToken) { throw createError({ statusCode: 401, message: 'Authentication required. Provide X-API-Key or Authorization: Bearer header.' }) } // Validate against your user database or KV store // const isValid = await validateApiKey(apiKey ?? bearerToken!) // if (!isValid) throw createError({ statusCode: 403, message: 'Invalid credentials' })})
Production Deployment Strategies
Cloudflare Pages + Workers (Recommended)
Cloudflare Pages with Workers offers edge-distributed execution, meaning your Nuxt server routes run in data centers close to your users worldwide — ideal for latency-sensitive AI streaming applications.
# Build for Cloudflare PagesNITRO_PRESET=cloudflare-pages npm run build# Deploy using Wranglernpm install -D wranglernpx wrangler pages deploy .output/public --project-name=gemini-nuxt-app
# wrangler.tomlname = "gemini-nuxt-app"compatibility_date = "2026-04-01"compatibility_flags = ["nodejs_compat"][vars]APP_ENV = "production"NODE_ENV = "production"# Never put secrets here — use `wrangler secret put` instead
# Store the API key securely as a Workers Secretnpx wrangler secret put GEMINI_API_KEY# Paste your key when prompted (it's encrypted at rest)
For the rate limiter in a Workers environment, replace the in-memory Map with Cloudflare KV:
Set the GEMINI_API_KEY environment variable in the Vercel dashboard under Project Settings → Environment Variables. Vercel automatically injects it at build and runtime.
Transient errors from the Gemini API (HTTP 429 and 503) should be handled with exponential backoff and jitter. For a detailed treatment of retry patterns, see the Gemini API Error Handling and Retry Patterns Complete Guide.
Unoptimized Gemini API usage can lead to surprisingly high bills. Here are the most impactful techniques. For a comprehensive cost breakdown, see the Gemini API Cost Optimization Complete Guide.
1. Always cap maxOutputTokens
The default unlimited output can generate thousands of tokens for simple requests. Set a reasonable cap:
config: { maxOutputTokens: 512, // For Q&A bots maxOutputTokens: 1024, // For content generation maxOutputTokens: 2048, // For long-form tasks temperature: 0.5 // Lower for factual tasks}
2. Use Gemini 2.5 Flash as the default
gemini-2.5-flash is ~10× cheaper than gemini-2.5-pro and handles the majority of use cases excellently. Only escalate to Pro when you need complex multi-step reasoning.
// In your server route — check cache before calling the APIconst cacheKey = `chat:${JSON.stringify({ message: body.message, model: 'gemini-2.5-flash' })}`const cached = getCached(cacheKey)if (cached) return { text: cached, cached: true }const response = await genai.models.generateContent(/* ... */)setCached(cacheKey, response.text)return { text: response.text, cached: false }
4. Leverage Context Caching for fixed content
When your app uses a long, fixed system prompt or knowledge base, cache it at the API level:
const cachedContent = await genai.caches.create({ model: 'gemini-2.5-flash', systemInstruction: 'You are a customer support agent for Acme Corp...', contents: [/* your knowledge base */], ttl: '3600s' // 1-hour cache})// Use the cached content in subsequent requestsconst response = await genai.models.generateContent({ model: 'gemini-2.5-flash', cachedContent: cachedContent.name, contents: userMessage})
For end-to-end testing of the streaming UI, Playwright handles SSE connections gracefully:
// tests/e2e/streaming.spec.tsimport { test, expect } from '@playwright/test'test('streaming chat renders tokens progressively', async ({ page }) => { await page.goto('/') const textarea = page.locator('textarea') const sendButton = page.locator('button:has-text("Send")') const responseArea = page.locator('.response-area') await textarea.fill('Tell me a short joke') await sendButton.click() // Verify streaming indicator appears await expect(page.locator('.cursor')).toBeVisible() // Wait for streaming to complete await expect(page.locator('.cursor')).toBeHidden({ timeout: 30_000 }) // Verify response content is non-empty const text = await responseArea.textContent() expect(text?.length).toBeGreaterThan(10)})
Mocking the API in Development
Use Nuxt's server/plugins/ to intercept API calls during development or testing, so you can work on the UI without consuming real API quota:
// server/plugins/mock-gemini.ts// Only active when MOCK_AI=true in .envexport default defineNitroPlugin((nitroApp) => { if (process.env.MOCK_AI !== 'true') return nitroApp.hooks.hook('request', (event) => { if (event.path === '/api/chat') { event.node.res.writeHead(200, { 'Content-Type': 'application/json' }) event.node.res.end(JSON.stringify({ text: '[MOCK] This is a simulated AI response for development.', usageMetadata: { totalTokens: 0 } })) } })})
Share
Thank You for Reading
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.