◈ API / SDK/2026-06-24Advanced

Citing the exact page and figure in File Search answers with visual-citation metadata

File Search grounding metadata now carries media_id and page_numbers, so you can trace each sentence of an answer back to a specific page and figure. Here's how I built a sentence-level, verifiable citation layer over a mix of PDFs and images.

gemini⁸⁸ file-search² grounding⁵ gemini-api²⁴⁶ rag²¹

✦ Premium Article

Feed a PDF into File Search and ask Gemini about it, and you could always get back "Source: design-spec.pdf." But the moment a reader or teammate asks "does it really say that?", you couldn't point them to which part of a 47-page PDF to read. As an indie developer running help-reference data for my own apps, I hit this wall over and over and ended up pasting screenshots by hand.

On June 24, 2026, File Search grounding metadata gained media_id (visual citations) and page_numbers, and that manual work is gone. You can now trace which sentence of an answer rests on which page and which figure, straight from the API response. This article walks through building a citation layer that attaches "page number + figure thumbnail" to each sentence, over reference data that mixes PDFs and images.

What actually changed — two new fields in grounding metadata

Until now, grounding metadata was, roughly, chunk-level: "this answer is based on these chunks." The two new fields push that granularity one level finer.

Field	Where it lives	Meaning
page_numbers	retrieved_context of each grounding chunk	Which PDF page(s) the chunk came from (an array when it spans pages)
media_id	retrieved_context of each grounding chunk	The visual-citation identifier — for image-derived chunks (figures, screenshots), it points to which image is the source

The key is how these combine with grounding_supports, which says "this span of the answer is supported by this chunk." Each support entry carries the start and end character index of an answer span plus the chunk indices behind it. Look up page_numbers and media_id by chunk index, and every sentence traces back to "page 12 of design-spec.pdf, figure 3."

Grasp the response shape first

Before the implementation, let's see what we're handling. A generate_content response with File Search enabled hangs grounding_metadata off candidates[0]. Cleaned up, it looks like this.

# Conceptual structure of grounding_metadata (a tidied real response)
{
  "grounding_chunks": [
    {
      "retrieved_context": {
        "title": "design-spec.pdf",
        "text": "Auth tokens expire after 3600 seconds by default…",
        "page_numbers": [12],          # <- new field
        "media_id": None               # text chunk, so None
      }
    },
    {
      "retrieved_context": {
        "title": "onboarding-flow.png",
        "text": "Login screen transition diagram",
        "page_numbers": None,
        "media_id": "media/abc123"     # <- new field (image-derived)
      }
    }
  ],
  "grounding_supports": [
    {
      "segment": {"start_index": 0, "end_index": 41, "text": "Tokens expire after 3600 seconds."},
      "grounding_chunk_indices": [0],
      "confidence_scores": [0.94]
    }
  ]
}

grounding_supports[i].grounding_chunk_indices points into grounding_chunks. Once you hold that mapping, the rest is just connecting sentences to sources.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Turn a setup that returned a source filename but never the exact page or figure into sentence-level, verifiable citations using page_numbers and media_id

✦Drop in rendering logic that joins grounding_supports with grounding_chunks to attach precise page numbers and figure thumbnails to every sentence

✦Take away the production fixes you'll actually hit: fallbacks for missing metadata and how to collapse duplicate citations from the same page

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Rendering logic that resolves page + figure per sentence

This is the heart of the article. We split the answer by the grounding_supports spans and attach a page number and image reference to each. The defenses against missing metadata are baked in from the start so you can use it as is.

from google import genai
from google.genai import types
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
def build_cited_answer(response):
    """Split the answer into supported spans and attach verifiable sources."""
    cand = response.candidates[0]
    meta = getattr(cand, "grounding_metadata", None)
    answer_text = cand.content.parts[0].text
 
    # No metadata = an ungrounded answer. Return it without sources.
    if meta is None or not getattr(meta, "grounding_supports", None):
        return [{"text": answer_text, "citations": []}]
 
    chunks = meta.grounding_chunks or []
    segments = []
 
    for sup in meta.grounding_supports:
        seg = sup.segment
        citations = []
        for idx in sup.grounding_chunk_indices:
            if idx >= len(chunks):
                continue  # guard against index drift
            ctx = chunks[idx].retrieved_context
            citations.append({
                "title": ctx.title,
                "pages": getattr(ctx, "page_numbers", None),   # e.g. [12]
                "media_id": getattr(ctx, "media_id", None),    # e.g. "media/abc123"
            })
        segments.append({
            "text": answer_text[seg.start_index:seg.end_index],
            "citations": _dedupe_citations(citations),
        })
    return segments
 
 
def _dedupe_citations(citations):
    """Collapse citations that point to the same file and page."""
    seen, out = set(), []
    for c in citations:
        key = (c["title"], tuple(c["pages"] or []), c["media_id"])
        if key in seen:
            continue
        seen.add(key)
        out.append(c)
    return out

Run an answer through this and it comes out structured like so.

# Example output of build_cited_answer
[
  {
    "text": "Tokens expire after 3600 seconds.",
    "citations": [{"title": "design-spec.pdf", "pages": [12], "media_id": None}]
  },
  {
    "text": "The post-login flow is shown in the following figure.",
    "citations": [{"title": "onboarding-flow.png", "pages": None, "media_id": "media/abc123"}]
  }
]

Each sentence is now tied to "page 12 of design-spec.pdf" or "the figure in onboarding-flow.png." All that's left is rendering it.

Pulling the actual figure from a media_id

A media_id is a string identifier, not the image itself. To show a thumbnail, you need one extra step to fetch that media from the File Search store. Whether you can actually show the figure makes or breaks how convincing the citation feels.

def resolve_media_thumbnail(client, media_id):
    """Fetch displayable image bytes from a media_id. Returns None on failure."""
    if not media_id:
        return None
    try:
        # Retrieve the stored media (the retrieval API depends on store config)
        media = client.files.get(name=media_id)
        return media  # a file reference; convert to an <img> src in the UI
    except Exception as e:
        # Expired or deleted media is not a rare case
        print(f"media resolve failed for {media_id}: {e}")
        return None

Why wrap it in try/except matters in production. When you update reference data, old media_id values expire. Even a few seconds of lag between generation and rendering can produce a "media not found." If you don't swallow that exception and fall back to a text citation here, the whole citation UI crashes. I underestimated this at first and shipped a bug where figures went blank for a moment right after a store rebuild.

Turn PDF page numbers into a reader path

page_numbers is useful just displayed, but wiring it to a PDF viewer's page anchor makes it genuinely practical. Most viewers open a page via the URL fragment #page=12.

def page_anchor_url(base_pdf_url, page_numbers):
    """Turn page_numbers into a URL that opens the right PDF page."""
    if not page_numbers:
        return base_pdf_url
    # Jump to the first page; show a range separately if needed
    return f"{base_pdf_url}#page={page_numbers[0]}"
 
# Usage
url = page_anchor_url("https://example.com/docs/design-spec.pdf", [12])
# -> "https://example.com/docs/design-spec.pdf#page=12"

With that in place, a link like "design-spec.pdf p.12 ↗" sits beside each sentence, and the reader jumps to the supporting paragraph in one click. Moving from "showing" a source to "letting the reader verify" it is what these two fields really unlock.

Pitfalls you will hit in production

The implementation is less of a time sink than the operations around it. Here are the holes I fell into running File Search help references, with fixes.

Symptom	Cause	Fix
Some sentences carry no source	That span isn't in grounding_supports (the model filled in from general knowledge)	Visually distinguish unsourced spans. Label them as "beyond the reference data," not fabrication
page_numbers stays None	A PDF where page boundaries can't be extracted (e.g. scanned images)	Run OCR + page tagging at ingest. Fall back to the page image via media_id
The same page is cited repeatedly	Multiple chunks retrieved from one page	Dedupe per page (the _dedupe_citations above)
media_id won't resolve	Old media expired after a data update	Fall back to a text citation in try/except and prompt a regenerate

That first row is the core of trustworthiness. Honestly showing in the UI that not every sentence has a source actually earns the reader's trust. Force a citation onto every line and you end up stamping "Source: foo.pdf" onto general claims that live outside the reference data — which is exactly where verification falls apart.

How well it fits your own reference data

This shines for documents with page structure (PDFs, slides) and for data where you want a figure to be the evidence. Conversely, a store of nothing but short text snippets won't carry page_numbers or media_id, and you're back to plain chunk citations. Take a quick inventory of which media your store is built from before adopting this, and you'll save the wasted effort.

Designing a multimodal store that mixes images in the first place is covered in unifying text and screenshots into one File Search. If you want the whole verification pipeline for sourced answers, citation generation and verification for sourced RAG is the foundation. And the basics of answering from your own data with File Search alone, no RAG, are in building data-grounded responses with the Gemini File Search API.

Start by dropping one PDF into File Search and running it through build_cited_answer to see whether page_numbers comes back. If the pages it returns line up with the text, your reference data is already ready to produce verifiable citations.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.