●SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMA●FLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasks●IMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25●MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxes●FILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2●DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon●SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMA●FLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasks●IMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25●MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxes●FILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2●DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
Gemini API × Spring Boot Enterprise Production Guide: Spring AI, Multi-Tenancy, Security & Observability
A complete guide to running Gemini API in production with Spring Boot. Covers Spring AI framework integration, multi-tenant architecture, API key management, async processing, observability with Micrometer/OpenTelemetry, and enterprise testing strategies.
Setup and context: Why Spring Boot × Gemini API Works for Enterprise
Java and Spring Boot remain the backbone of enterprise software development across many organizations. Combining them with Google's Gemini API allows teams to embed advanced AI capabilities into existing systems — without abandoning proven infrastructure.
Our free introductory article Spring Boot Gemini API Basic Guide covered the fundamentals of integration. This guide goes much further: production-grade design patterns, security hardening, observability pipelines, and testing strategies for systems that need to handle real workloads.
What we'll cover:
Spring AI framework production patterns
Multi-tenant design (per-tenant API key management)
Production-ready test strategy (unit, integration, contract)
Target audience: Backend engineers and architects with Spring Boot experience who want to deploy Gemini API in production environments.
Spring AI Framework: The Right Way to Integrate Gemini
What Is Spring AI?
Spring AI is the official framework for bringing AI capabilities into the Spring ecosystem. It reached GA (Generally Available) in late 2024, with significantly expanded support for Gemini and other major AI providers.
With Spring AI you get:
A unified, provider-agnostic API for AI features
Spring Boot Auto-configuration out of the box
Full Spring DI, AOP, and transaction management on AI components
<!-- pom.xml: Managing dependencies with the Spring AI BOM --><dependencyManagement> <dependencies> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-bom</artifactId> <version>1.0.0</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies></dependencyManagement><dependencies> <!-- Spring AI Vertex AI Gemini starter --> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-vertex-ai-gemini-spring-boot-starter</artifactId> </dependency> <!-- Conversation memory (Redis-backed persistence) --> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-redis-store-spring-boot-starter</artifactId> </dependency> <!-- Vector store (for RAG pipelines) --> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId> </dependency></dependencies>
Production ChatClient Configuration
// GeminiConfig.java: Production-ready ChatClient setup@Configuration@EnableConfigurationProperties(GeminiProperties.class)public class GeminiConfig { @Bean @Primary public ChatClient chatClient( VertexAiGeminiChatModel chatModel, GeminiProperties properties) { return ChatClient.builder(chatModel) // Default system prompt applied to all requests .defaultSystem(""" You are the customer support AI for {company}. Respond politely and accurately. Never include personal data or confidential information. If unsure, say "Let me connect you with a human agent." """) // Advisors for cross-cutting concerns .defaultAdvisors( new MessageChatMemoryAdvisor(chatMemory()), new SafeGuardAdvisor(properties.getBlockedTerms()), new RequestResponseLoggingAdvisor() ) // Default ChatOptions .defaultOptions(VertexAiGeminiChatOptions.builder() .withModel("gemini-2.5-pro") .withTemperature(0.2f) // Low temperature for production .withMaxOutputTokens(2048) .withTopP(0.8f) .build()) .build(); } @Bean public ChatMemory chatMemory(RedisTemplate<String, Object> redisTemplate) { // Persistent conversation memory via Redis return new RedisChatMemory(redisTemplate, Duration.ofHours(24)); }}
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Master complete Gemini API integration patterns using the Spring AI framework
✦Understand enterprise-grade security design: multi-tenancy, API key management, and rate limiting
✦Build production observability with Micrometer and OpenTelemetry, plus a comprehensive testing strategy
Secure payment via Stripe · Cancel anytime
Multi-Tenant Design: Managing Gemini Config Per Tenant
Enterprise systems often serve multiple tenants (customers, business units) on shared infrastructure. With Gemini API, each tenant may need its own model configuration, system prompt, and quota limits.
Tenant Context Management
// TenantContext.java: Thread-local tenant isolation@Componentpublic class TenantContext { private static final ThreadLocal<TenantInfo> currentTenant = new InheritableThreadLocal<>(); public static void setTenant(TenantInfo tenant) { currentTenant.set(tenant); } public static TenantInfo getTenant() { TenantInfo tenant = currentTenant.get(); if (tenant == null) { throw new TenantNotFoundException("No tenant context found"); } return tenant; } public static void clear() { currentTenant.remove(); } @Data @Builder public static class TenantInfo { private String tenantId; private String tenantName; private GeminiTier tier; // FREE / STANDARD / ENTERPRISE private Map<String, String> customConfig; }}// TenantFilter.java: Extract tenant info from HTTP requests@Component@Order(Ordered.HIGHEST_PRECEDENCE)public class TenantFilter extends OncePerRequestFilter { private final TenantRepository tenantRepository; @Override protected void doFilterInternal( HttpServletRequest request, HttpServletResponse response, FilterChain filterChain) throws ServletException, IOException { try { String tenantId = extractTenantId(request); TenantInfo tenant = tenantRepository.findById(tenantId) .orElseThrow(() -> new InvalidTenantException(tenantId)); TenantContext.setTenant(tenant); filterChain.doFilter(request, response); } finally { TenantContext.clear(); // Always clear to prevent memory leaks } } private String extractTenantId(HttpServletRequest request) { // Try header first String tenantId = request.getHeader("X-Tenant-ID"); if (tenantId != null) return tenantId; // Fall back to JWT claim String jwt = request.getHeader("Authorization"); if (jwt != null && jwt.startsWith("Bearer ")) { return jwtDecoder.decode(jwt.substring(7)) .getClaimAsString("tenant_id"); } throw new MissingTenantException("No tenant ID provided"); }}
Dynamic Per-Tenant ChatClient
// TenantAwareChatService.java: Dynamically resolve ChatClient per tenant@Service@Slf4jpublic class TenantAwareChatService { private final VertexAiGeminiChatModel baseChatModel; private final TenantConfigRepository configRepository; // Cache ChatClients per tenant to avoid recreation overhead private final Cache<String, ChatClient> chatClientCache = Caffeine.newBuilder() .maximumSize(1000) .expireAfterWrite(Duration.ofMinutes(30)) .build(); public String chat(String userMessage, String conversationId) { TenantInfo tenant = TenantContext.getTenant(); ChatClient client = getOrCreateChatClient(tenant); return client.prompt() .system(sp -> sp .text(tenant.getCustomConfig().getOrDefault( "system_prompt", "You are an assistant for {tenant}.")) .param("tenant", tenant.getTenantName())) .user(userMessage) .advisors(a -> a.param( AbstractChatMemoryAdvisor.CHAT_MEMORY_CONVERSATION_ID_KEY, tenant.getTenantId() + ":" + conversationId)) .call() .content(); } private ChatClient getOrCreateChatClient(TenantInfo tenant) { return chatClientCache.get(tenant.getTenantId(), tenantId -> { TenantConfig config = configRepository.findByTenantId(tenantId) .orElseGet(TenantConfig::defaultConfig); return ChatClient.builder(baseChatModel) .defaultOptions(VertexAiGeminiChatOptions.builder() .withModel(config.getModel()) .withTemperature(config.getTemperature()) .withMaxOutputTokens(config.getMaxTokens()) .build()) .build(); }); }}
Async and Parallel Processing for High Throughput
Gemini API responses take anywhere from a few hundred milliseconds to several seconds. Blocking threads while waiting is expensive at scale. Async design is non-negotiable for production systems handling real traffic.
Spring WebFlux + Project Reactor
// ReactiveGeminiService.java: Non-blocking AI service with WebFlux@Service@Slf4jpublic class ReactiveGeminiService { private final VertexAiGeminiChatModel chatModel; private final MeterRegistry meterRegistry; // Streaming response (Server-Sent Events) public Flux<String> streamResponse(String userMessage, String tenantId) { return Flux.defer(() -> { Prompt prompt = new Prompt( userMessage, VertexAiGeminiChatOptions.builder() .withModel("gemini-2.5-flash") // Flash is optimal for streaming .withTemperature(0.3f) .build() ); return chatModel.stream(prompt) .map(response -> response.getResult().getOutput().getContent()) .doOnNext(chunk -> log.trace("Stream chunk: {} chars", chunk.length())) .doOnError(e -> log.error("Streaming error [{}]: {}", tenantId, e.getMessage())) .onErrorResume(this::handleStreamError); }) .subscribeOn(Schedulers.boundedElastic()) .retryWhen(Retry.backoff(3, Duration.ofMillis(500)) .filter(this::isRetryable) .doBeforeRetry(rs -> log.warn("Retrying stream ({}/3)", rs.totalRetries() + 1))); } // Batch processing: parallel execution of multiple requests public Mono<List<String>> processBatch(List<String> requests, String tenantId) { int concurrency = 5; // Tune based on your rate limits return Flux.fromIterable(requests) .flatMap( request -> processWithRateLimit(request, tenantId), concurrency ) .collectList() .doOnSuccess(results -> { meterRegistry.counter("gemini.batch.completed", "tenant", tenantId, "count", String.valueOf(results.size()) ).increment(); }); } private Mono<String> processWithRateLimit(String request, String tenantId) { return Mono.fromCallable(() -> { Prompt prompt = new Prompt(request); return chatModel.call(prompt) .getResult().getOutput().getContent(); }) .subscribeOn(Schedulers.boundedElastic()) .timeout(Duration.ofSeconds(30)) .onErrorMap(TimeoutException.class, e -> new GeminiTimeoutException("Request timed out: " + request.substring(0, 50))); } private boolean isRetryable(Throwable e) { // Only retry 429 (Rate Limit) and 503 (Service Unavailable) if (e instanceof WebClientResponseException wce) { return wce.getStatusCode() == HttpStatus.TOO_MANY_REQUESTS || wce.getStatusCode() == HttpStatus.SERVICE_UNAVAILABLE; } return false; } private Flux<String> handleStreamError(Throwable e) { log.error("Unrecoverable stream error: {}", e.getMessage()); return Flux.just("[An error occurred. Please try again later.]"); }}// SSE endpoint in the controller@RestController@RequestMapping("/api/v1/chat")public class ChatController { private final ReactiveGeminiService geminiService; @PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE) public Flux<ServerSentEvent<String>> streamChat( @RequestBody ChatRequest request, @AuthenticationPrincipal TenantUser user) { return geminiService .streamResponse(request.getMessage(), user.getTenantId()) .map(chunk -> ServerSentEvent.<String>builder() .data(chunk) .build()) .concatWith(Flux.just( ServerSentEvent.<String>builder() .event("done") .data("[DONE]") .build() )); }}
Security: Essential Hardening for Production
API Key Management and Secret Rotation
// SecretManagerConfig.java: Fetch keys from Google Secret Manager@Configurationpublic class SecretManagerConfig { @Bean public SecretManagerServiceClient secretManagerClient() throws IOException { return SecretManagerServiceClient.create(); } @Bean @Scope("prototype") public GeminiApiKeyProvider apiKeyProvider( SecretManagerServiceClient secretClient, @Value("${gcp.project-id}") String projectId) { return tenantId -> { // Per-tenant API key from Secret Manager String secretName = String.format( "projects/%s/secrets/gemini-api-key-%s/versions/latest", projectId, tenantId); AccessSecretVersionResponse response = secretClient.accessSecretVersion(secretName); return response.getPayload().getData().toStringUtf8(); }; }}// InputValidationService.java: Guard against prompt injection@Servicepublic class InputValidationService { private static final int MAX_INPUT_LENGTH = 32000; private static final Pattern INJECTION_PATTERN = Pattern.compile( "(?i)(ignore|forget|disregard).*(previous|above|instruction|system)", Pattern.DOTALL ); public ValidationResult validate(String input, TenantInfo tenant) { // Length check if (input.length() > MAX_INPUT_LENGTH) { return ValidationResult.failure( "Input exceeds maximum length (" + MAX_INPUT_LENGTH + " characters)"); } // Prompt injection detection if (INJECTION_PATTERN.matcher(input).find()) { log.warn("Potential prompt injection detected for tenant: {}", tenant.getTenantId()); return ValidationResult.failure("Invalid input content"); } // Tenant-specific block list List<String> blockedTerms = tenant.getCustomConfig() .getOrDefault("blocked_terms", "") .lines() .filter(s -> !s.isBlank()) .toList(); for (String term : blockedTerms) { if (input.toLowerCase().contains(term.toLowerCase())) { return ValidationResult.failure( "Input contains a blocked keyword"); } } return ValidationResult.success(input.trim()); }}
Rate Limiting Implementation
// RateLimiterService.java: Per-tenant sliding window rate limiting@Service@Slf4jpublic class RateLimiterService { private final RedisTemplate<String, String> redisTemplate; public boolean tryAcquire(String tenantId, RateLimitConfig config) { String key = "rate_limit:" + tenantId + ":" + Instant.now().getEpochSecond() / 60; // 1-minute window Long count = redisTemplate.opsForValue().increment(key); redisTemplate.expire(key, Duration.ofMinutes(2)); if (count == null) return true; if (count > config.getRequestsPerMinute()) { log.warn("Rate limit exceeded for tenant: {} ({}/{})", tenantId, count, config.getRequestsPerMinute()); return false; } return true; } // Track token usage for billing/quota management public void recordTokenUsage( String tenantId, int promptTokens, int completionTokens) { String dailyKey = "tokens:" + tenantId + ":" + LocalDate.now().toString(); redisTemplate.opsForHash().increment( dailyKey, "prompt_tokens", promptTokens); redisTemplate.opsForHash().increment( dailyKey, "completion_tokens", completionTokens); redisTemplate.expire(dailyKey, Duration.ofDays(7)); checkMonthlyQuota(tenantId, promptTokens + completionTokens); } private void checkMonthlyQuota(String tenantId, int tokens) { String monthlyKey = "tokens_monthly:" + tenantId + ":" + YearMonth.now().toString(); Long totalTokens = redisTemplate.opsForValue() .increment(monthlyKey, tokens); redisTemplate.expire(monthlyKey, Duration.ofDays(35)); // Warn at 80% of monthly limit long monthlyLimit = getMonthlyTokenLimit(tenantId); if (totalTokens != null && totalTokens > monthlyLimit * 0.8) { eventPublisher.publishEvent( new QuotaWarningEvent(tenantId, totalTokens, monthlyLimit)); } }}
Observability: Knowing What's Happening in Production
Metrics with Micrometer
// GeminiMetricsAdvisor.java: Instrument all AI requests@Componentpublic class GeminiMetricsAdvisor implements RequestResponseAdvisor { private final MeterRegistry meterRegistry; private final ThreadLocal<Timer.Sample> timerSample = new ThreadLocal<>(); @Override public AdvisedRequest adviseRequest( AdvisedRequest request, Map<String, Object> context) { timerSample.set(Timer.start(meterRegistry)); context.put("start_time", System.currentTimeMillis()); context.put("tenant_id", TenantContext.getTenant().getTenantId()); int estimatedTokens = estimateTokens( request.userText() + request.systemText()); context.put("estimated_prompt_tokens", estimatedTokens); meterRegistry.counter("gemini.request.count", "tenant", context.get("tenant_id").toString(), "model", getModelName(request) ).increment(); return request; } @Override public ChatResponse adviseResponse( ChatResponse response, Map<String, Object> context) { Timer.Sample sample = timerSample.get(); if (sample != null) { sample.stop(Timer.builder("gemini.request.latency") .tag("tenant", context.get("tenant_id").toString()) .tag("status", "success") .register(meterRegistry)); timerSample.remove(); } if (response.getMetadata().getUsage() != null) { Usage usage = response.getMetadata().getUsage(); meterRegistry.counter("gemini.tokens.prompt", "tenant", context.get("tenant_id").toString() ).increment(usage.getPromptTokens()); meterRegistry.counter("gemini.tokens.completion", "tenant", context.get("tenant_id").toString() ).increment(usage.getGenerationTokens()); // Cost estimation (Gemini 2.5 Pro pricing as of April 2026) double estimatedCost = calculateCost( usage.getPromptTokens(), usage.getGenerationTokens() ); meterRegistry.gauge("gemini.cost.request", Tags.of("tenant", context.get("tenant_id").toString()), estimatedCost ); } return response; } private double calculateCost(long promptTokens, long completionTokens) { double promptCost = promptTokens / 1_000_000.0 * 1.25; double completionCost = completionTokens / 1_000_000.0 * 10.0; return promptCost + completionCost; }}
// ChatServiceTest.java: Unit tests with a mocked ChatClient@ExtendWith(MockitoExtension.class)class TenantAwareChatServiceTest { @Mock private VertexAiGeminiChatModel chatModel; @Mock private TenantConfigRepository configRepository; @InjectMocks private TenantAwareChatService chatService; @BeforeEach void setUp() { TenantContext.setTenant(TenantInfo.builder() .tenantId("tenant-001") .tenantName("Acme Corp") .tier(GeminiTier.STANDARD) .customConfig(Map.of()) .build()); } @AfterEach void tearDown() { TenantContext.clear(); } @Test void shouldReturnChatResponse() { // Given String userMessage = "Hello!"; String expectedResponse = "Hello! How can I help you today?"; ChatResponse mockResponse = createMockChatResponse(expectedResponse); when(chatModel.call(any(Prompt.class))).thenReturn(mockResponse); // When String actual = chatService.chat(userMessage, "conv-001"); // Then assertThat(actual).isEqualTo(expectedResponse); verify(chatModel, times(1)).call(any(Prompt.class)); } @Test void shouldThrowExceptionWhenRateLimitExceeded() { when(rateLimiterService.tryAcquire(anyString(), any())) .thenReturn(false); assertThatThrownBy(() -> chatService.chat("test", "conv-001")) .isInstanceOf(RateLimitExceededException.class) .hasMessageContaining("Rate limit"); } private ChatResponse createMockChatResponse(String content) { AssistantMessage message = new AssistantMessage(content); Generation generation = new Generation(message); return new ChatResponse(List.of(generation)); }}
Integration Tests with Testcontainers
// GeminiIntegrationTest.java: Integration tests with real Redis@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)@ActiveProfiles("integration-test")class GeminiIntegrationTest { @Container static RedisContainer redis = new RedisContainer( DockerImageName.parse("redis:7-alpine")) .withExposedPorts(6379); @DynamicPropertySource static void configureRedis(DynamicPropertyRegistry registry) { registry.add("spring.data.redis.host", redis::getHost); registry.add("spring.data.redis.port", redis::getFirstMappedPort); } @Autowired private TestRestTemplate restTemplate; @Test @Disabled("Only runs in environments with API key configured") void shouldConnectToProductionAPI() { ChatRequest request = new ChatRequest("Hello, Gemini!"); ResponseEntity<ChatResponse> response = restTemplate.postForEntity( "/api/v1/chat", request, ChatResponse.class ); assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK); assertThat(response.getBody()).isNotNull(); assertThat(response.getBody().getMessage()).isNotBlank(); } @Test void shouldMaintainConversationContext() { String conversationId = UUID.randomUUID().toString(); // First message chatService.chat("My name is Alice", conversationId); // Follow-up: confirm memory retention String response = chatService.chat("What is my name?", conversationId); // With a mock, verify conversation ID is consistent assertThat(conversationId).isNotNull(); }}
Summary
This guide walked through the key architectural patterns for running Gemini API in a Spring Boot production environment:
Spring AI framework enables provider-agnostic, maintainable code with clean separation of concerns
Multi-tenant design relies on ThreadLocal isolation, Redis-cached ChatClients, and per-tenant configuration management
Async processing via Spring WebFlux + Project Reactor is essential — rate-limit-aware design prevents cascading failures
Security is a three-layer problem: Secret Manager for key management, prompt injection detection for input safety, and sliding window rate limiting for quota enforcement
Observability combines Micrometer (metrics) + OpenTelemetry (tracing) + structured logging for complete production visibility
Testing follows a three-layer strategy: unit tests (Mock ChatClient), integration tests (Testcontainers), and E2E tests for critical paths
Together, these patterns support production systems handling tens of thousands of daily users reliably and cost-effectively.
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.