GEMINI LABJP
SIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soonSIRI — WWDC 2026 confirms the revamped Siri runs on a Google Gemini model, though it won't ship in the EU at iOS 27 due to the DMAFLASH3.5 — Gemini 3.5 Flash is now GA, the top Flash model for sustained frontier performance on agentic and coding tasksIMAGE-GA — Gemini 3.1 Flash Image and 3.1 Pro Image are GA as native visual models; the preview versions shut down Jun 25MANAGED-AGENTS — Managed Agents launch in public preview in the Gemini API, running autonomous agents in Google-hosted isolated Linux sandboxesFILE-SEARCH — File Search now supports multimodal search, with native image embedding and retrieval via gemini-embedding-2DEPRECATION — gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shut down Jun 25 — migrate to the GA models soon
Articles/API / SDK
API / SDK/2026-04-14Advanced

Gemini API Embeddings vs Vector Databases: Pinecone, Qdrant, pgvector, and Cloud Spanner Compared for Production

Benchmark Pinecone, Qdrant, pgvector, and Cloud Spanner Vector using Gemini text-embedding-004 with real latency, cost, and code. The definitive production selection guide.

gemini114embeddings13vector-database4pinecone2qdrantpgvector4rag23production124

Premium Article

Right before pushing a RAG system to production, most developers hit the same wall: Pinecone feels expensive, pgvector sounds like a maintenance headache, Qdrant looks promising but lacks English documentation, and Cloud Spanner seems overkill. The individual tutorials are everywhere. A side-by-side benchmark using the same Gemini embedding model across all four? Almost nowhere.

This article fills that gap. Using text-embedding-004 as the universal baseline, I ran every major vector database through the same test suite — real insert workloads, production-grade query patterns, and cost modeling at various scales. What follows is what the benchmarks actually showed, along with the production pitfalls that only reveal themselves once traffic hits.

Why Vector Database Selection Is a Critical Production Decision

Two failure modes appear after a wrong choice.

Cost explosion: The combination of query volume, embedding dimensions, and index size can push managed service bills 10x above estimates. One team estimated their Pinecone Serverless costs at $30/month, then watched it climb to $280 after adding metadata to every vector and neglecting dimension optimization. Simply switching from 1536 to 768 dimensions — which text-embedding-004 supports natively — cut their bill by 60%.

Latency wall: For chat interfaces and real-time search, P99 vector search latency above 100ms breaks the user experience. The wrong database plus misconfigured indexes can push search times above 300ms even on 100,000-vector collections.

Setting Up text-embedding-004

All benchmarks use a consistent baseline:

# Baseline: Gemini text-embedding-004
import google.generativeai as genai
import os
import time
 
# Available dimensions: 256 / 512 / 768 (default)
# Max input tokens: 2,048
# Max batch size: 100 texts per request
 
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
 
def get_embedding(
    text: str,
    task_type: str = "RETRIEVAL_DOCUMENT",
    dimensions: int = 768
) -> list[float]:
    """Get a single embedding from Gemini."""
    result = genai.embed_content(
        model="models/text-embedding-004",
        content=text,
        task_type=task_type,  # RETRIEVAL_DOCUMENT / RETRIEVAL_QUERY / SEMANTIC_SIMILARITY
        output_dimensionality=dimensions
    )
    return result["embedding"]
 
# Verify output
doc_vec = get_embedding("How to use the Gemini API", task_type="RETRIEVAL_DOCUMENT")
query_vec = get_embedding("Teach me how to use the API", task_type="RETRIEVAL_QUERY")
print(f"Dimension: {len(doc_vec)}")  # Output: Dimension: 768

The task_type distinction matters more than most tutorials suggest. Using RETRIEVAL_DOCUMENT for indexed content and RETRIEVAL_QUERY for search queries can improve cosine similarity scores by 3–5 percentage points. Re-indexing to fix a task_type mistake on 100,000 documents takes 10–15 minutes — catch it early.

Production Batch Client: Shared Foundation for All Databases

Before diving into each database, build the shared embedding infrastructure. One-at-a-time API calls take 10–15 minutes per 10,000 documents. This client handles batching, retries, and rate limiting so it works with any backend.

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
 
class GeminiEmbeddingClient:
    """Production-ready Gemini embedding client."""
 
    def __init__(self, api_key: str, dimensions: int = 768):
        genai.configure(api_key=api_key)
        self.dimensions = dimensions
        self.max_batch = 100      # API hard limit
        self.rpm_limit = 1000     # Paid tier limit (Free: 100 RPM)
        self._request_times: list[float] = []
 
    def _throttle(self):
        """Enforce RPM limit using a 60-second sliding window."""
        now = time.time()
        self._request_times = [t for t in self._request_times if now - t < 60]
        if len(self._request_times) >= self.rpm_limit:
            sleep_time = 60 - (now - self._request_times[0]) + 0.1
            time.sleep(sleep_time)
        self._request_times.append(time.time())
 
    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=30),
        retry=retry_if_exception_type(Exception),
        reraise=True
    )
    def _embed_batch_raw(self, texts: list[str], task_type: str) -> list[list[float]]:
        """Core batch embedding call with retry logic."""
        self._throttle()
        result = genai.embed_content(
            model="models/text-embedding-004",
            content=texts,
            task_type=task_type,
            output_dimensionality=self.dimensions
        )
        return result["embedding"]
 
    def embed_documents(self, texts: list[str], show_progress: bool = True) -> list[list[float]]:
        """Embed a list of documents (RETRIEVAL_DOCUMENT mode)."""
        return self._embed_all(texts, "RETRIEVAL_DOCUMENT", show_progress)
 
    def embed_query(self, query: str) -> list[float]:
        """Embed a single search query (RETRIEVAL_QUERY mode)."""
        return self._embed_batch_raw([query], "RETRIEVAL_QUERY")[0]
 
    def _embed_all(self, texts: list[str], task_type: str, show_progress: bool) -> list[list[float]]:
        embeddings = []
        total = len(texts)
        for i in range(0, total, self.max_batch):
            batch = texts[i:i + self.max_batch]
            try:
                embeddings.extend(self._embed_batch_raw(batch, task_type))
            except Exception as e:
                # Batch failed — fall back to individual requests
                print(f"Batch failed ({i}{i+len(batch)}): {e}")
                for text in batch:
                    try:
                        embeddings.append(self._embed_batch_raw([text], task_type)[0])
                    except Exception as e2:
                        print(f"  Individual also failed: {e2} — using zero vector")
                        embeddings.append([0.0] * self.dimensions)
            if show_progress:
                print(f"  Progress: {min(i + self.max_batch, total)}/{total}")
        return embeddings
 
client = GeminiEmbeddingClient(api_key=os.environ["GEMINI_API_KEY"])

Every database implementation below uses this client. Swap out the storage layer without touching the embedding logic.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Get real benchmark data across all four vector databases using the same Gemini embedding model — and make the right choice for your scale immediately
Copy-paste production-ready code with retry logic, rate limiting, and batch processing for every database
Learn the configuration patterns that cut RAG costs by 70% at one million queries per month
Secure payment via Stripe · Cancel anytime
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-04-03
Building a Production RAG System with Gemini Embedding API and Pinecone
A step-by-step guide to building a production-ready RAG system using Gemini Embedding API and Pinecone. Covers index design, query optimization, chunking strategies, and cost management with practical Python code.
API / SDK2026-03-29
Building Production Semantic Search with Gemini Embeddings API — Design, Implementation, and Operations
A comprehensive guide to building production-grade semantic search with Gemini Embeddings API. Covers vector DB selection, reranking, recommendation engines, and cost optimization with practical code.
API / SDK2026-04-28
Beyond Embeddings: Production Reranking with Vertex AI Ranking and Gemini-as-Judge
When pure embedding search nails the top-3 but buries the right answer at rank 4, you need a reranker. This guide walks through a production-grade two-stage architecture using Vertex AI Ranking API and Gemini-as-judge — with cost, latency, and evaluation patterns that hold up under load.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →