◈ API / SDK/2026-06-19Advanced

Catch Near-Duplicate Images Before You Publish with gemini-embedding-2

This is about removing near-duplicates, not image search. Use gemini-embedding-2 multimodal embeddings to vectorize images, cluster them, and build a pre-publish gate — with working code and threshold guidance.

gemini⁸⁵ gemini-embedding-2³ embeddings¹¹ image⁵ deduplication

✦ Premium Article

When you run several sites, image assets bite you later not because you have too few, but because near-identical ones quietly pile up. As an indie developer I keep the OGP images for four blogs plus a set of wallpaper apps under Dolice, and for the last six months I've increasingly paused on "wait, haven't I already published this pale blue abstract background?" Checking one by one stops being realistic once the count crosses three digits.

In June 2026, File Search gained multimodal search with gemini-embedding-2, which adds a clean tool for this problem. But what we want here is not search. Instead of finding and pulling back similar images, we want to reject images that are too similar before they ship. These two goals differ in both intent and implementation, and conflating them lets the gate pass everything through.

Why image search can't reject near-duplicates

Retrieval returns "the top N closest to a query." It always returns something, and a loose threshold still works. Near-duplicate detection needs something else: a binary judgment of "are these two images close enough to be considered effectively the same?"

If you repurpose retrieval directly, the single closest item is always returned, so completely unrelated images still line up as "similar candidates." Conversely, if you leave the threshold tuned for search, you miss the recolors and crops you actually want to catch. A near-duplicate gate has to switch to a design where the score itself is the decision boundary.

I underestimated this difference at first and judged "no duplicates" just by glancing at the top File Search results. In reality, similarly composed gradient backgrounds had grown into three lineages; the search context simply treated them as separate hits.

Vectorize the images

First, convert each image into an embedding vector with gemini-embedding-2. Because it's multimodal, you pass an image part to the same endpoint you'd use for text.

import os
from pathlib import Path
from google import genai
from google.genai import types
 
client = genai.Client(api_key="YOUR_API_KEY")
 
EMBED_MODEL = "gemini-embedding-2"  # multimodal (GA, 2026-06)
 
def embed_image(path: Path) -> list[float]:
    data = path.read_bytes()
    mime = "image/png" if path.suffix.lower() == ".png" else "image/jpeg"
    resp = client.models.embed_content(
        model=EMBED_MODEL,
        contents=[types.Part.from_bytes(data=data, mime_type=mime)],
    )
    return resp.embeddings[0].values

If you L2-normalize the vectors up front, the dot product is the cosine similarity, which keeps the later math simple.

import math
 
def l2_normalize(v):
    norm = math.sqrt(sum(x * x for x in v)) or 1.0
    return [x / norm for x in v]
 
def cosine(a, b):
    # a, b are normalized -> dot product is the cosine similarity
    return sum(x * y for x, y in zip(a, b))
 
def build_index(paths):
    index = {}
    for p in paths:
        index[str(p)] = l2_normalize(embed_image(p))
    return index

Since embedding hits the API once per image, a naive implementation pays that cost every run as assets grow. In my own setup, I store vectors locally keyed by the file hash and skip re-embedding any image whose content hasn't changed.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If you keep hesitating because 'I think I've published something like this before,' you can now flag near-duplicates automatically before publishing

✦You get working code that vectorizes images with gemini-embedding-2 multimodal embeddings and isolates only the near-duplicates via cosine similarity and threshold clustering

✦You'll understand how to choose a threshold and how to handle crops and recolors, so you can grow OGP and wallpaper assets across multiple sites without drift into redundancy

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Cluster the near-duplicates

Take the cosine similarity over all pairs and group images linked above the threshold. Because I want to fold "A is close to B, B is close to C" into one cluster, a union-find (disjoint-set) structure expresses this cleanly.

class UnionFind:
    def __init__(self, keys):
        self.parent = {k: k for k in keys}
 
    def find(self, k):
        while self.parent[k] != k:
            self.parent[k] = self.parent[self.parent[k]]  # path compression
            k = self.parent[k]
        return k
 
    def union(self, a, b):
        ra, rb = self.find(a), self.find(b)
        if ra != rb:
            self.parent[ra] = rb
 
 
def cluster_near_duplicates(index, threshold=0.96):
    keys = list(index)
    uf = UnionFind(keys)
    for i in range(len(keys)):
        for j in range(i + 1, len(keys)):
            if cosine(index[keys[i]], index[keys[j]]) >= threshold:
                uf.union(keys[i], keys[j])
    clusters = {}
    for k in keys:
        clusters.setdefault(uf.find(k), []).append(k)
    # only clusters with 2+ members are "near-duplicates"
    return [m for m in clusters.values() if len(m) > 1]

All-pairs comparison is O(n²), so beyond a few thousand images I'd move to approximate nearest neighbor (ANN). But at my scale of a few hundred per site, the naive brute force is fast enough, and I prefer getting it running first to confirm the behavior.

Wire it into a pre-publish gate

What pays off in practice is less the "re-cluster everything" pass and more a gate that checks, right before publishing, whether a newly added candidate is too close to existing assets. The existing vector index is already saved, so you embed only the candidates and compare.

from pathlib import Path
 
def prepublish_dedup_gate(candidate_paths, existing_index, threshold=0.96):
    flagged = []
    for p in candidate_paths:
        vec = l2_normalize(embed_image(Path(p)))
        best_key, best_sim = None, 0.0
        for old_key, old_vec in existing_index.items():
            sim = cosine(vec, old_vec)
            if sim > best_sim:
                best_key, best_sim = old_key, sim
        if best_sim >= threshold:
            flagged.append((str(p), best_key, round(best_sim, 4)))
    return flagged
 
 
def pick_representative(members):
    # use file size as a proxy for resolution / information content
    return max(members, key=lambda p: Path(p).stat().st_size)

If flagged is non-empty, I stop the CI or pre-deploy script with exit 1. I deliberately don't auto-delete here; I only print a list of "which existing image, and how close." Which one to keep and which to drop is something I want a human to decide in the end. I keep the same line I draw elsewhere: I don't have AI produce the artwork itself; I use AI only to assist operational judgment.

Choosing the threshold, and the traps

I start around 0.96, but that's not a fixed value — it's a number you tune to your own asset tendencies. Here are rough guideposts I measured on my own wallpaper and OGP images.

Threshold	Relationships caught	Operational impression
0.99+	Near-exact matches (re-export, light recompression)	Misses a lot. Only obvious duplicates
0.96–0.98	Recolors, slight crops, same-lineage composition	This band feels practical for solo operation
0.93 and below	Down to "the mood is similar"	False positives rise; it bundles distinct works

Be careful that multimodal embeddings capture semantic closeness, so a lower threshold can crush the subtle differences a human reads as "this is a separate work." For instance, two shots of the same subject at different times tend to score high when the composition is close. Whether to reject those as duplicates touches the intent behind the work, so a threshold alone can't decide it.

Another trap is extreme aspect-ratio differences. If you cut the same source material into a wide OGP and a square app icon, the visual subject is the same but the embedding distance widens. I don't want that kind of "same material, different use" caught by the gate, so I keep separate indexes per output-size lineage.

What I noticed once it was in production

The biggest change after inserting this gate was the awareness "before I make something similar." Getting blocked right before publishing is expensive to undo, so at the point of creating a new asset I started thinking about "deliberately diverging from the existing lineage." The gate erases duplicates and doubles as a self-check for the person creating.

For rollout order, I recommend clustering all existing assets once to visualize the current state, cleaning up the obvious near-duplicates, and only then making the pre-publish gate permanent. Trying to reject everything with a strict threshold from day one feels like it's negating your past decisions, and that makes operations feel cramped.

I was a little hesitant at first to use AI for near-duplicate judgments. But as long as you keep the final "which one stays" decision in your own hands, AI becomes a partner that reduces hesitation. I hope this is a useful first step for anyone struggling with the same bloat in their image assets.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.