GEMINI LABJP
API — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAIENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companiesAGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choicesSPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContentDATA — Crossbeam data stores can now connect to Gemini Enterprise in public previewMODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloadsAPI — The Gemini API now processes over 16 billion tokens per minute, roughly on par with OpenAIENTERPRISE — Gemini Enterprise passes 8 million paid seats across more than 2,800 companiesAGENT — Claude Opus 4.8 arrives on Gemini Enterprise Agent Platform, expanding multi-vendor choicesSPEECH — gemini-3.1-flash-tts-preview adds streaming speech generation via streamGenerateContentDATA — Crossbeam data stores can now connect to Gemini Enterprise in public previewMODEL — Gemini 3.5 Flash GA and Gemma 4 round out options for agentic and lightweight workloads
Articles/API / SDK
API / SDK/2026-07-04Advanced

Catch Near-Duplicate Posts Before You Publish — a Topic-Cannibalization Gate with Gemini Embeddings

Once a blog passes a few hundred posts, new articles start cannibalizing old ones in search. This walks through a pre-publish gate that embeds each post's meaning with gemini-embedding-2 and blocks drafts that sit too close to something you already wrote — with runnable code and how to pick the thresholds.

gemini-api261embedding9semantic-search3seo3content-pipelinepython96

Premium Article

Have you ever reached for the publish button and frozen, thinking "wait, didn't I already write something like this"? As an indie developer running several technical blogs, I started hitting that déjà vu constantly once a corpus crossed a few hundred posts.

The tricky part is when the title and slug are completely different, yet to a search engine the two posts answer nearly the same question. In SEO terms this is keyword cannibalization, and it drags both posts into a mediocre middle of the rankings. No human reviewer can reliably check a new draft against 600 existing posts every single time.

So I built a pre-publish gate: before a draft goes live, its meaning vector is compared against the whole existing archive, and anything too close gets sent back. I covered the image side of this in the near-duplicate image gate article; this is the prose counterpart. It uses gemini-embedding-2, and I'll share the threshold tuning and false-positive handling in the order they actually mattered in production.

Why keyword matching and title comparison miss the real duplicates

The first thing I tried was string matching on titles and descriptions. It was almost useless. "Cutting Gemini API costs" and "Designing so Gemini billing doesn't balloon at month end" share almost no words, yet the reader's search intent is nearly identical. Meanwhile "Cutting Gemini costs" and "Compressing images with Gemini" share "cost" and "Gemini" but are entirely different articles.

Surface words can't measure semantic distance. That's where embeddings come in. An embedding turns text into a high-dimensional vector, placing semantically similar text closer together in space. Even with zero overlapping words, two posts that "say the same thing in different words" show up close, captured by a single number: cosine similarity.

For me this gap was starkest in my operations posts. Three articles describing similar operational know-how in varied wording never matched on string comparison, yet clustered around 0.9 similarity to each other under embeddings.

What to embed — the full body is counterproductive

Naively turning the whole article body into one vector actually lowers accuracy. Long posts get diluted by code samples and asides, and the core topic drowns inside the vector. After some trial and error, I settled on this:

  1. The title
  2. The description (the article's thesis compressed into ~160 characters)
  3. Every H2 and H3 heading (the skeleton of the piece itself)

I embed a "topic summary text" that concatenates these three. Headings are the blueprint the author deliberately laid out, so they capture the semantic core far better than the surrounding prose. Code samples and quotes are intentionally excluded.

import re
from pathlib import Path
 
def build_topic_text(mdx_path: Path) -> str:
    """Assemble a topic-summary text (title + description + headings) from MDX."""
    text = mdx_path.read_text(encoding="utf-8")
    m = re.match(r"^---\n(.*?)\n---\n(.*)$", text, re.DOTALL)
    front, body = m.group(1), m.group(2)
 
    def fm(key: str) -> str:
        mm = re.search(rf'^{key}:\s*"?(.*?)"?\s*$', front, re.MULTILINE)
        return mm.group(1) if mm else ""
 
    # strip code blocks before extracting headings
    body_no_code = re.sub(r"```.*?```", "", body, flags=re.DOTALL)
    headings = re.findall(r"^#{2,3}\s+(.*)$", body_no_code, re.MULTILINE)
 
    parts = [fm("title"), fm("description")] + headings
    return "\n".join(p for p in parts if p)

The output is a single text with "title + thesis + all headings" on separate lines. That gets vectorized in the next step.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
You can catch 'too semantically close' posts that title matching never sees, using cosine similarity over gemini-embedding-2 vectors, before you hit publish
You'll get the thresholds that actually held up on a 670-article corpus (0.88 to review, 0.92 to reject) plus a two-stage design that only asks Gemini to judge the borderline cases
You'll be able to run a gate for a continuously growing archive — section-level chunking, incremental reindexing, and SQLite storage included
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-07-02
Routing Between Local Gemma 4 and the Gemini API Cut My Bill from ¥32,000 to ¥9,000 — A Production Hybrid Router Design
How I cut a ¥32,000/month Gemini API bill to the ¥9,000 range with hybrid inference: routing design, a full Python router, production pitfalls, and how Gemma 4 arriving on the Gemini API in July 2026 changes the decision.
API / SDK2026-06-28
Mixing Text and Images in One File Search Skewed My Results Toward Images — Rebalancing by Modality After Retrieval
When you put text and images in a single File Search store with gemini-embedding-2, results can quietly skew toward one modality. Here is how to measure that skew and even it out after retrieval, using per-modality normalization and quota-based merging — with working code.
API / SDK2026-06-28
A Promotion Gate So gemini-flash-latest Flipping to 3.5 Flash Doesn't Break Your Pipeline at 3 AM
Floating aliases like gemini-flash-latest swap their target on every GA, quietly shifting the assumptions your unattended automation depends on. Here is a role-to-pinned-ID indirection layer, an acceptance harness that measures four metrics against your own golden set, and threshold-driven promotion and automatic rollback — with working code.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →