◈ API / SDK/2026-04-02Advanced

Gemini API × Spring Boot Enterprise Production Guide: Spring AI, Multi-Tenancy, Security & Observability

A complete guide to running Gemini API in production with Spring Boot. Covers Spring AI framework integration, multi-tenant architecture, API key management, async processing, observability with Micrometer/OpenTelemetry, and enterprise testing strategies.

gemini-api²⁷⁸ spring-boot² spring-ai java² enterprise⁵ multi-tenant production¹⁴⁰ observability¹²

✦ Premium Article

Setup and context: Why Spring Boot × Gemini API Works for Enterprise

Java and Spring Boot remain the backbone of enterprise software development across many organizations. Combining them with Google's Gemini API allows teams to embed advanced AI capabilities into existing systems — without abandoning proven infrastructure.

Our free introductory article Spring Boot Gemini API Basic Guide covered the fundamentals of integration. This guide goes much further: production-grade design patterns, security hardening, observability pipelines, and testing strategies for systems that need to handle real workloads.

What we'll cover:

Spring AI framework production patterns
Multi-tenant design (per-tenant API key management)
Persistent conversation memory management
Async and parallel processing for high throughput
Security implementation (API key management, rate limiting, input validation)
Observability with Micrometer and OpenTelemetry
Production-ready test strategy (unit, integration, contract)

Target audience: Backend engineers and architects with Spring Boot experience who want to deploy Gemini API in production environments.

Spring AI Framework: The Right Way to Integrate Gemini

What Is Spring AI?

Spring AI is the official framework for bringing AI capabilities into the Spring ecosystem. It reached GA (Generally Available) in late 2024, with significantly expanded support for Gemini and other major AI providers.

With Spring AI you get:

A unified, provider-agnostic API for AI features
Spring Boot Auto-configuration out of the box
Full Spring DI, AOP, and transaction management on AI components

<!-- pom.xml: Managing dependencies with the Spring AI BOM -->
<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.springframework.ai</groupId>
      <artifactId>spring-ai-bom</artifactId>
      <version>1.0.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>
 
<dependencies>
  <!-- Spring AI Vertex AI Gemini starter -->
  <dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-vertex-ai-gemini-spring-boot-starter</artifactId>
  </dependency>
 
  <!-- Conversation memory (Redis-backed persistence) -->
  <dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-redis-store-spring-boot-starter</artifactId>
  </dependency>
 
  <!-- Vector store (for RAG pipelines) -->
  <dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
  </dependency>
</dependencies>

Production ChatClient Configuration

// GeminiConfig.java: Production-ready ChatClient setup
@Configuration
@EnableConfigurationProperties(GeminiProperties.class)
public class GeminiConfig {
 
    @Bean
    @Primary
    public ChatClient chatClient(
            VertexAiGeminiChatModel chatModel,
            GeminiProperties properties) {
 
        return ChatClient.builder(chatModel)
            // Default system prompt applied to all requests
            .defaultSystem("""
                You are the customer support AI for {company}.
                Respond politely and accurately.
                Never include personal data or confidential information.
                If unsure, say "Let me connect you with a human agent."
                """)
            // Advisors for cross-cutting concerns
            .defaultAdvisors(
                new MessageChatMemoryAdvisor(chatMemory()),
                new SafeGuardAdvisor(properties.getBlockedTerms()),
                new RequestResponseLoggingAdvisor()
            )
            // Default ChatOptions
            .defaultOptions(VertexAiGeminiChatOptions.builder()
                .withModel("gemini-2.5-pro")
                .withTemperature(0.2f)   // Low temperature for production
                .withMaxOutputTokens(2048)
                .withTopP(0.8f)
                .build())
            .build();
    }
 
    @Bean
    public ChatMemory chatMemory(RedisTemplate<String, Object> redisTemplate) {
        // Persistent conversation memory via Redis
        return new RedisChatMemory(redisTemplate, Duration.ofHours(24));
    }
}

# application-production.yml
spring:
  ai:
    vertex:
      ai:
        gemini:
          project-id: ${GCP_PROJECT_ID}
          location: us-central1
          # Service Account auth for production (not API Key)
          transport: grpc    # gRPC outperforms HTTP/2 for throughput
 
  # Redis conversation memory
  data:
    redis:
      host: ${REDIS_HOST}
      port: 6379
      password: ${REDIS_PASSWORD}
      ssl:
        enabled: true
 
gemini:
  blocked-terms:
    - "password"
    - "credit card"
  rate-limit:
    requests-per-minute: 60
    tokens-per-minute: 100000

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Master complete Gemini API integration patterns using the Spring AI framework

✦Understand enterprise-grade security design: multi-tenancy, API key management, and rate limiting

✦Build production observability with Micrometer and OpenTelemetry, plus a comprehensive testing strategy

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Multi-Tenant Design: Managing Gemini Config Per Tenant

Enterprise systems often serve multiple tenants (customers, business units) on shared infrastructure. With Gemini API, each tenant may need its own model configuration, system prompt, and quota limits.

Tenant Context Management

// TenantContext.java: Thread-local tenant isolation
@Component
public class TenantContext {
 
    private static final ThreadLocal<TenantInfo> currentTenant =
        new InheritableThreadLocal<>();
 
    public static void setTenant(TenantInfo tenant) {
        currentTenant.set(tenant);
    }
 
    public static TenantInfo getTenant() {
        TenantInfo tenant = currentTenant.get();
        if (tenant == null) {
            throw new TenantNotFoundException("No tenant context found");
        }
        return tenant;
    }
 
    public static void clear() {
        currentTenant.remove();
    }
 
    @Data
    @Builder
    public static class TenantInfo {
        private String tenantId;
        private String tenantName;
        private GeminiTier tier;   // FREE / STANDARD / ENTERPRISE
        private Map<String, String> customConfig;
    }
}
 
// TenantFilter.java: Extract tenant info from HTTP requests
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class TenantFilter extends OncePerRequestFilter {
 
    private final TenantRepository tenantRepository;
 
    @Override
    protected void doFilterInternal(
            HttpServletRequest request,
            HttpServletResponse response,
            FilterChain filterChain) throws ServletException, IOException {
 
        try {
            String tenantId = extractTenantId(request);
            TenantInfo tenant = tenantRepository.findById(tenantId)
                .orElseThrow(() -> new InvalidTenantException(tenantId));
 
            TenantContext.setTenant(tenant);
            filterChain.doFilter(request, response);
        } finally {
            TenantContext.clear(); // Always clear to prevent memory leaks
        }
    }
 
    private String extractTenantId(HttpServletRequest request) {
        // Try header first
        String tenantId = request.getHeader("X-Tenant-ID");
        if (tenantId != null) return tenantId;
 
        // Fall back to JWT claim
        String jwt = request.getHeader("Authorization");
        if (jwt != null && jwt.startsWith("Bearer ")) {
            return jwtDecoder.decode(jwt.substring(7))
                .getClaimAsString("tenant_id");
        }
 
        throw new MissingTenantException("No tenant ID provided");
    }
}

Dynamic Per-Tenant ChatClient

// TenantAwareChatService.java: Dynamically resolve ChatClient per tenant
@Service
@Slf4j
public class TenantAwareChatService {
 
    private final VertexAiGeminiChatModel baseChatModel;
    private final TenantConfigRepository configRepository;
 
    // Cache ChatClients per tenant to avoid recreation overhead
    private final Cache<String, ChatClient> chatClientCache =
        Caffeine.newBuilder()
            .maximumSize(1000)
            .expireAfterWrite(Duration.ofMinutes(30))
            .build();
 
    public String chat(String userMessage, String conversationId) {
        TenantInfo tenant = TenantContext.getTenant();
        ChatClient client = getOrCreateChatClient(tenant);
 
        return client.prompt()
            .system(sp -> sp
                .text(tenant.getCustomConfig().getOrDefault(
                    "system_prompt",
                    "You are an assistant for {tenant}."))
                .param("tenant", tenant.getTenantName()))
            .user(userMessage)
            .advisors(a -> a.param(
                AbstractChatMemoryAdvisor.CHAT_MEMORY_CONVERSATION_ID_KEY,
                tenant.getTenantId() + ":" + conversationId))
            .call()
            .content();
    }
 
    private ChatClient getOrCreateChatClient(TenantInfo tenant) {
        return chatClientCache.get(tenant.getTenantId(), tenantId -> {
            TenantConfig config = configRepository.findByTenantId(tenantId)
                .orElseGet(TenantConfig::defaultConfig);
 
            return ChatClient.builder(baseChatModel)
                .defaultOptions(VertexAiGeminiChatOptions.builder()
                    .withModel(config.getModel())
                    .withTemperature(config.getTemperature())
                    .withMaxOutputTokens(config.getMaxTokens())
                    .build())
                .build();
        });
    }
}

Async and Parallel Processing for High Throughput

Gemini API responses take anywhere from a few hundred milliseconds to several seconds. Blocking threads while waiting is expensive at scale. Async design is non-negotiable for production systems handling real traffic.

Spring WebFlux + Project Reactor

// ReactiveGeminiService.java: Non-blocking AI service with WebFlux
@Service
@Slf4j
public class ReactiveGeminiService {
 
    private final VertexAiGeminiChatModel chatModel;
    private final MeterRegistry meterRegistry;
 
    // Streaming response (Server-Sent Events)
    public Flux<String> streamResponse(String userMessage, String tenantId) {
        return Flux.defer(() -> {
            Prompt prompt = new Prompt(
                userMessage,
                VertexAiGeminiChatOptions.builder()
                    .withModel("gemini-2.5-flash")  // Flash is optimal for streaming
                    .withTemperature(0.3f)
                    .build()
            );
 
            return chatModel.stream(prompt)
                .map(response -> response.getResult().getOutput().getContent())
                .doOnNext(chunk -> log.trace("Stream chunk: {} chars", chunk.length()))
                .doOnError(e -> log.error("Streaming error [{}]: {}", tenantId, e.getMessage()))
                .onErrorResume(this::handleStreamError);
        })
        .subscribeOn(Schedulers.boundedElastic())
        .retryWhen(Retry.backoff(3, Duration.ofMillis(500))
            .filter(this::isRetryable)
            .doBeforeRetry(rs -> log.warn("Retrying stream ({}/3)", rs.totalRetries() + 1)));
    }
 
    // Batch processing: parallel execution of multiple requests
    public Mono<List<String>> processBatch(List<String> requests, String tenantId) {
        int concurrency = 5; // Tune based on your rate limits
 
        return Flux.fromIterable(requests)
            .flatMap(
                request -> processWithRateLimit(request, tenantId),
                concurrency
            )
            .collectList()
            .doOnSuccess(results -> {
                meterRegistry.counter("gemini.batch.completed",
                    "tenant", tenantId,
                    "count", String.valueOf(results.size())
                ).increment();
            });
    }
 
    private Mono<String> processWithRateLimit(String request, String tenantId) {
        return Mono.fromCallable(() -> {
                Prompt prompt = new Prompt(request);
                return chatModel.call(prompt)
                    .getResult().getOutput().getContent();
            })
            .subscribeOn(Schedulers.boundedElastic())
            .timeout(Duration.ofSeconds(30))
            .onErrorMap(TimeoutException.class,
                e -> new GeminiTimeoutException("Request timed out: " + request.substring(0, 50)));
    }
 
    private boolean isRetryable(Throwable e) {
        // Only retry 429 (Rate Limit) and 503 (Service Unavailable)
        if (e instanceof WebClientResponseException wce) {
            return wce.getStatusCode() == HttpStatus.TOO_MANY_REQUESTS ||
                   wce.getStatusCode() == HttpStatus.SERVICE_UNAVAILABLE;
        }
        return false;
    }
 
    private Flux<String> handleStreamError(Throwable e) {
        log.error("Unrecoverable stream error: {}", e.getMessage());
        return Flux.just("[An error occurred. Please try again later.]");
    }
}
 
// SSE endpoint in the controller
@RestController
@RequestMapping("/api/v1/chat")
public class ChatController {
 
    private final ReactiveGeminiService geminiService;
 
    @PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<String>> streamChat(
            @RequestBody ChatRequest request,
            @AuthenticationPrincipal TenantUser user) {
 
        return geminiService
            .streamResponse(request.getMessage(), user.getTenantId())
            .map(chunk -> ServerSentEvent.<String>builder()
                .data(chunk)
                .build())
            .concatWith(Flux.just(
                ServerSentEvent.<String>builder()
                    .event("done")
                    .data("[DONE]")
                    .build()
            ));
    }
}

Security: Essential Hardening for Production

API Key Management and Secret Rotation

// SecretManagerConfig.java: Fetch keys from Google Secret Manager
@Configuration
public class SecretManagerConfig {
 
    @Bean
    public SecretManagerServiceClient secretManagerClient()
            throws IOException {
        return SecretManagerServiceClient.create();
    }
 
    @Bean
    @Scope("prototype")
    public GeminiApiKeyProvider apiKeyProvider(
            SecretManagerServiceClient secretClient,
            @Value("${gcp.project-id}") String projectId) {
 
        return tenantId -> {
            // Per-tenant API key from Secret Manager
            String secretName = String.format(
                "projects/%s/secrets/gemini-api-key-%s/versions/latest",
                projectId, tenantId);
 
            AccessSecretVersionResponse response =
                secretClient.accessSecretVersion(secretName);
 
            return response.getPayload().getData().toStringUtf8();
        };
    }
}
 
// InputValidationService.java: Guard against prompt injection
@Service
public class InputValidationService {
 
    private static final int MAX_INPUT_LENGTH = 32000;
    private static final Pattern INJECTION_PATTERN =
        Pattern.compile(
            "(?i)(ignore|forget|disregard).*(previous|above|instruction|system)",
            Pattern.DOTALL
        );
 
    public ValidationResult validate(String input, TenantInfo tenant) {
        // Length check
        if (input.length() > MAX_INPUT_LENGTH) {
            return ValidationResult.failure(
                "Input exceeds maximum length (" + MAX_INPUT_LENGTH + " characters)");
        }
 
        // Prompt injection detection
        if (INJECTION_PATTERN.matcher(input).find()) {
            log.warn("Potential prompt injection detected for tenant: {}",
                tenant.getTenantId());
            return ValidationResult.failure("Invalid input content");
        }
 
        // Tenant-specific block list
        List<String> blockedTerms = tenant.getCustomConfig()
            .getOrDefault("blocked_terms", "")
            .lines()
            .filter(s -> !s.isBlank())
            .toList();
 
        for (String term : blockedTerms) {
            if (input.toLowerCase().contains(term.toLowerCase())) {
                return ValidationResult.failure(
                    "Input contains a blocked keyword");
            }
        }
 
        return ValidationResult.success(input.trim());
    }
}

Rate Limiting Implementation

// RateLimiterService.java: Per-tenant sliding window rate limiting
@Service
@Slf4j
public class RateLimiterService {
 
    private final RedisTemplate<String, String> redisTemplate;
 
    public boolean tryAcquire(String tenantId, RateLimitConfig config) {
        String key = "rate_limit:" + tenantId + ":" +
            Instant.now().getEpochSecond() / 60; // 1-minute window
 
        Long count = redisTemplate.opsForValue().increment(key);
        redisTemplate.expire(key, Duration.ofMinutes(2));
 
        if (count == null) return true;
 
        if (count > config.getRequestsPerMinute()) {
            log.warn("Rate limit exceeded for tenant: {} ({}/{})",
                tenantId, count, config.getRequestsPerMinute());
            return false;
        }
 
        return true;
    }
 
    // Track token usage for billing/quota management
    public void recordTokenUsage(
            String tenantId, int promptTokens, int completionTokens) {
 
        String dailyKey = "tokens:" + tenantId + ":" +
            LocalDate.now().toString();
 
        redisTemplate.opsForHash().increment(
            dailyKey, "prompt_tokens", promptTokens);
        redisTemplate.opsForHash().increment(
            dailyKey, "completion_tokens", completionTokens);
        redisTemplate.expire(dailyKey, Duration.ofDays(7));
 
        checkMonthlyQuota(tenantId, promptTokens + completionTokens);
    }
 
    private void checkMonthlyQuota(String tenantId, int tokens) {
        String monthlyKey = "tokens_monthly:" + tenantId + ":" +
            YearMonth.now().toString();
 
        Long totalTokens = redisTemplate.opsForValue()
            .increment(monthlyKey, tokens);
        redisTemplate.expire(monthlyKey, Duration.ofDays(35));
 
        // Warn at 80% of monthly limit
        long monthlyLimit = getMonthlyTokenLimit(tenantId);
        if (totalTokens != null && totalTokens > monthlyLimit * 0.8) {
            eventPublisher.publishEvent(
                new QuotaWarningEvent(tenantId, totalTokens, monthlyLimit));
        }
    }
}

Observability: Knowing What's Happening in Production

Metrics with Micrometer

// GeminiMetricsAdvisor.java: Instrument all AI requests
@Component
public class GeminiMetricsAdvisor implements RequestResponseAdvisor {
 
    private final MeterRegistry meterRegistry;
    private final ThreadLocal<Timer.Sample> timerSample = new ThreadLocal<>();
 
    @Override
    public AdvisedRequest adviseRequest(
            AdvisedRequest request, Map<String, Object> context) {
 
        timerSample.set(Timer.start(meterRegistry));
        context.put("start_time", System.currentTimeMillis());
        context.put("tenant_id", TenantContext.getTenant().getTenantId());
 
        int estimatedTokens = estimateTokens(
            request.userText() + request.systemText());
        context.put("estimated_prompt_tokens", estimatedTokens);
 
        meterRegistry.counter("gemini.request.count",
            "tenant", context.get("tenant_id").toString(),
            "model", getModelName(request)
        ).increment();
 
        return request;
    }
 
    @Override
    public ChatResponse adviseResponse(
            ChatResponse response, Map<String, Object> context) {
 
        Timer.Sample sample = timerSample.get();
        if (sample != null) {
            sample.stop(Timer.builder("gemini.request.latency")
                .tag("tenant", context.get("tenant_id").toString())
                .tag("status", "success")
                .register(meterRegistry));
            timerSample.remove();
        }
 
        if (response.getMetadata().getUsage() != null) {
            Usage usage = response.getMetadata().getUsage();
 
            meterRegistry.counter("gemini.tokens.prompt",
                "tenant", context.get("tenant_id").toString()
            ).increment(usage.getPromptTokens());
 
            meterRegistry.counter("gemini.tokens.completion",
                "tenant", context.get("tenant_id").toString()
            ).increment(usage.getGenerationTokens());
 
            // Cost estimation (Gemini 2.5 Pro pricing as of April 2026)
            double estimatedCost = calculateCost(
                usage.getPromptTokens(),
                usage.getGenerationTokens()
            );
 
            meterRegistry.gauge("gemini.cost.request",
                Tags.of("tenant", context.get("tenant_id").toString()),
                estimatedCost
            );
        }
 
        return response;
    }
 
    private double calculateCost(long promptTokens, long completionTokens) {
        double promptCost = promptTokens / 1_000_000.0 * 1.25;
        double completionCost = completionTokens / 1_000_000.0 * 10.0;
        return promptCost + completionCost;
    }
}

Distributed Tracing with OpenTelemetry

// TracingConfig.java: OpenTelemetry setup for OTLP export
@Configuration
public class TracingConfig {
 
    @Bean
    public OpenTelemetrySdk openTelemetry(
            @Value("${otel.exporter.otlp.endpoint}") String endpoint) {
 
        return OpenTelemetrySdk.builder()
            .setTracerProvider(SdkTracerProvider.builder()
                .addSpanProcessor(BatchSpanProcessor.builder(
                    OtlpGrpcSpanExporter.builder()
                        .setEndpoint(endpoint)
                        .build())
                    .build())
                .setResource(Resource.getDefault().merge(
                    Resource.create(Attributes.of(
                        ResourceAttributes.SERVICE_NAME, "gemini-api-service",
                        ResourceAttributes.SERVICE_VERSION, "2.0.0"
                    ))))
                .build())
            .build();
    }
}
 
// GeminiTracingService.java: Wrap AI calls in trace spans
@Service
public class GeminiTracingService {
 
    private final Tracer tracer;
    private final ChatClient chatClient;
 
    public String chatWithTracing(
            String userMessage, String tenantId, String conversationId) {
 
        Span span = tracer.spanBuilder("gemini.chat")
            .setAttribute("tenant.id", tenantId)
            .setAttribute("conversation.id", conversationId)
            .setAttribute("user.message.length", userMessage.length())
            .startSpan();
 
        try (Scope scope = span.makeCurrent()) {
            String response = chatClient.prompt()
                .user(userMessage)
                .advisors(a -> a.param(
                    AbstractChatMemoryAdvisor.CHAT_MEMORY_CONVERSATION_ID_KEY,
                    tenantId + ":" + conversationId))
                .call()
                .content();
 
            span.setAttribute("response.length", response.length());
            span.setStatus(StatusCode.OK);
 
            return response;
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, e.getMessage());
            throw e;
        } finally {
            span.end();
        }
    }
}

Testing Strategy: Ensuring Production Quality

Unit Tests with MockChatClient

// ChatServiceTest.java: Unit tests with a mocked ChatClient
@ExtendWith(MockitoExtension.class)
class TenantAwareChatServiceTest {
 
    @Mock
    private VertexAiGeminiChatModel chatModel;
 
    @Mock
    private TenantConfigRepository configRepository;
 
    @InjectMocks
    private TenantAwareChatService chatService;
 
    @BeforeEach
    void setUp() {
        TenantContext.setTenant(TenantInfo.builder()
            .tenantId("tenant-001")
            .tenantName("Acme Corp")
            .tier(GeminiTier.STANDARD)
            .customConfig(Map.of())
            .build());
    }
 
    @AfterEach
    void tearDown() {
        TenantContext.clear();
    }
 
    @Test
    void shouldReturnChatResponse() {
        // Given
        String userMessage = "Hello!";
        String expectedResponse = "Hello! How can I help you today?";
 
        ChatResponse mockResponse = createMockChatResponse(expectedResponse);
        when(chatModel.call(any(Prompt.class))).thenReturn(mockResponse);
 
        // When
        String actual = chatService.chat(userMessage, "conv-001");
 
        // Then
        assertThat(actual).isEqualTo(expectedResponse);
        verify(chatModel, times(1)).call(any(Prompt.class));
    }
 
    @Test
    void shouldThrowExceptionWhenRateLimitExceeded() {
        when(rateLimiterService.tryAcquire(anyString(), any()))
            .thenReturn(false);
 
        assertThatThrownBy(() -> chatService.chat("test", "conv-001"))
            .isInstanceOf(RateLimitExceededException.class)
            .hasMessageContaining("Rate limit");
    }
 
    private ChatResponse createMockChatResponse(String content) {
        AssistantMessage message = new AssistantMessage(content);
        Generation generation = new Generation(message);
        return new ChatResponse(List.of(generation));
    }
}

Integration Tests with Testcontainers

// GeminiIntegrationTest.java: Integration tests with real Redis
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@ActiveProfiles("integration-test")
class GeminiIntegrationTest {
 
    @Container
    static RedisContainer redis = new RedisContainer(
        DockerImageName.parse("redis:7-alpine"))
        .withExposedPorts(6379);
 
    @DynamicPropertySource
    static void configureRedis(DynamicPropertyRegistry registry) {
        registry.add("spring.data.redis.host", redis::getHost);
        registry.add("spring.data.redis.port", redis::getFirstMappedPort);
    }
 
    @Autowired
    private TestRestTemplate restTemplate;
 
    @Test
    @Disabled("Only runs in environments with API key configured")
    void shouldConnectToProductionAPI() {
        ChatRequest request = new ChatRequest("Hello, Gemini!");
 
        ResponseEntity<ChatResponse> response = restTemplate.postForEntity(
            "/api/v1/chat",
            request,
            ChatResponse.class
        );
 
        assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
        assertThat(response.getBody()).isNotNull();
        assertThat(response.getBody().getMessage()).isNotBlank();
    }
 
    @Test
    void shouldMaintainConversationContext() {
        String conversationId = UUID.randomUUID().toString();
 
        // First message
        chatService.chat("My name is Alice", conversationId);
 
        // Follow-up: confirm memory retention
        String response = chatService.chat("What is my name?", conversationId);
 
        // With a mock, verify conversation ID is consistent
        assertThat(conversationId).isNotNull();
    }
}

Summary

This guide walked through the key architectural patterns for running Gemini API in a Spring Boot production environment:

Spring AI framework enables provider-agnostic, maintainable code with clean separation of concerns
Multi-tenant design relies on ThreadLocal isolation, Redis-cached ChatClients, and per-tenant configuration management
Async processing via Spring WebFlux + Project Reactor is essential — rate-limit-aware design prevents cascading failures
Security is a three-layer problem: Secret Manager for key management, prompt injection detection for input safety, and sliding window rate limiting for quota enforcement
Observability combines Micrometer (metrics) + OpenTelemetry (tracing) + structured logging for complete production visibility
Testing follows a three-layer strategy: unit tests (Mock ChatClient), integration tests (Testcontainers), and E2E tests for critical paths

Together, these patterns support production systems handling tens of thousands of daily users reliably and cost-effectively.

For the latest Spring AI features and Gemini 2.5 Pro updates, monitor the official Spring AI documentation and Google AI for Developers regularly.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.