●DEPRECATION — The two image preview models shut down today, June 25; automations using them must migrate immediately●GA — In their place, gemini-3.1-flash-image and gemini-3-pro-image are now the generally available native image models●MEDIA — Video-to-image generation arrives: pass a video as context to create high-quality thumbnails (3.1 flash image only)●AUDIO — Gemini 3.1 Flash TTS preview lands: a low-cost, expressive, steerable text-to-speech model●MODEL — Gemini 3.5 Flash is GA, beating 3.1 Pro on nearly every benchmark while running about 4x faster●SEARCH — File Search now supports multimodal search, embedding and searching images natively via gemini-embedding-2●DEPRECATION — The two image preview models shut down today, June 25; automations using them must migrate immediately●GA — In their place, gemini-3.1-flash-image and gemini-3-pro-image are now the generally available native image models●MEDIA — Video-to-image generation arrives: pass a video as context to create high-quality thumbnails (3.1 flash image only)●AUDIO — Gemini 3.1 Flash TTS preview lands: a low-cost, expressive, steerable text-to-speech model●MODEL — Gemini 3.5 Flash is GA, beating 3.1 Pro on nearly every benchmark while running about 4x faster●SEARCH — File Search now supports multimodal search, embedding and searching images natively via gemini-embedding-2
Citation-Grounded RAG with Gemini: Production Patterns for Source Attribution and Hallucination Detection
A practical guide to wiring trustworthy citations into a Gemini-powered RAG pipeline. Covers structured output, post-hoc validation, UI rendering, and a quantitative grounding score you can put on a dashboard.
"We built a Gemini-powered knowledge search, but our team still spends just as much time fact-checking the answers as before." That was a comment I heard from a legal-tech team a few weeks ago. They had retrofitted RAG onto their internal docs, but because the answers were not traceable to specific passages, every reviewer ended up reopening the originals anyway. The whole point of RAG — saving time — had quietly evaporated.
As an indie developer, I have run into this same problem more than once while shipping Gemini into production. Asking the model "please cite your sources" in the system prompt produces text that looks like a citation, but whether the cited passage actually exists in the context, and whether the quoted span actually says what the model claims, is a separate question entirely. The gap between "looks cited" and "is cited" is where most RAG products silently fail.
This article walks through how to add trustworthy citations to a Gemini RAG pipeline — structured output, validation, UI rendering, and a quantitative grounding score you can monitor over time. By the end, you should be able to wire a citation-validation pipeline into your own product that catches hallucinations before they reach users, and to argue with numbers when you need to prioritize this work alongside other RAG improvements.
Why citations decide whether your RAG is trustworthy enough to ship
The two real value propositions of RAG, in my view, are "reduce the time spent searching" and "answer with verifiable grounding." If you only deliver the first one, you have essentially built a slightly fancier search engine, and the LLM is dead weight. The second value proposition is what makes RAG worth its complexity, but it is also the one that requires the most engineering effort to deliver reliably.
Yet in production, the second goal is rarely implemented seriously. Most teams stop at "please cite your sources" in the prompt, and never write code that verifies whether those citations exist in the context, let alone whether the quoted span matches the source. I have audited a dozen or so customer-facing RAG implementations over the past year, and only two of them validated citations in any structured way. The rest treated source attribution as decorative text.
From the user's perspective, an unsourced AI answer has zero re-checkability. In legal, medical, education, customer support — any domain where users have to defend their actions to someone else — RAG without verifiable citations simply is not deployable. The user cannot reasonably take the answer to a partner, a doctor, a teacher, or a manager without first reopening the original documents to verify it, which means the AI has saved nothing.
Conversely, when citations are reliable, users tolerate occasional model errors gracefully, because they can verify the parts they care about. That tolerance is what makes a human-AI workflow actually function in regulated environments. I have seen support teams adopt RAG enthusiastically once they trust the citations, and I have seen identical RAG products rejected outright by similar teams when the citations turned out to be unreliable. The difference between adoption and rejection is rarely model quality; it is whether the product owns the verification layer.
There is a less obvious benefit too: structured citations are an anchor for hallucination detection. Once Gemini emits citations as data instead of free text, you can mechanically verify them and produce a measurable hallucination rate. That number is the single most important lever for taking a RAG product to production with confidence — without it, every conversation about model regressions is anecdotal, and you cannot defend the product to compliance, legal, or executive stakeholders with any rigor.
Three approaches to citations, compared
Implementations of citation in a Gemini app land in roughly three buckets. Here is how I think about the tradeoffs after running each in production over the past year.
Approach A: Free-text "Source: ..." in the response
The simplest path is to add "always end with a source list" to your system prompt and parse the resulting text. A regex grabs the source list, and you display it under the answer.
Pros: minimal effort, just a prompt edit
Cons: hallucinated sources, fragile regex parsing, virtually no way to validate, no structured connection between specific claims and specific sources
My take: fine for a side project or a v0.1 demo, not safe for production
The reason this approach fails in production is subtle. The model produces a list of sources that "feel related" to the question, but there is no per-claim mapping. A user who finds one wrong fact in the answer cannot determine which source it came from, and you cannot programmatically detect that the wrong fact lacks support in any of the listed sources.
Approach B: Structured output with claim and source_ids pairs
Use responseSchema to force Gemini to emit a JSON object pairing each claim with the source IDs that back it.
Pros: deterministic parsing, type-safe, easy to layer validation on top, per-claim mapping
Cons: you have to expose source IDs to the model, so fabrication risk remains, and you cannot validate the actual passage that was used
My take: a strong default for mid-sized products that need reliability but operate in lower-stakes domains
This is where I would start most B2B SaaS implementations. It is dramatically better than Approach A — you can detect phantom IDs, you can score per-claim coverage, and the JSON shape is amenable to caching, telemetry, and downstream processing. The remaining gap is that you cannot verify the actual passage; the model could cite a real source ID while making up what that source says.
Approach C: Span-grounded citations (recommended for production)
Have the model return both the source ID and the actual quoted span (e.g., 80 characters) so you can match it against the original text.
Pros: the quoted span enables string-level verification, which dramatically improves hallucination detection; the user UI can show the exact supporting sentence
Cons: more code, both for generation and validation; the prompt is slightly longer
My take: required in legal, medical, or any compliance-sensitive domain — and increasingly the right default even for general-purpose RAG
The rest of this article centers on Approach C. Approach B is just "Approach C minus quoted_span," so the same code applies with one field removed. The complexity overhead of Approach C is small enough that, if you are starting fresh, I would recommend going there directly rather than retrofitting later.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Port a production pipeline that mechanically rejects fabricated source_ids — which 'please cite your sources' alone cannot prevent — straight into your own RAG, code and all
✦Compare the latency, cost, and citation-dropout of Approach B, Approach C, and the claim-split chain with measured numbers, so you can pick a configuration that matches your reliability requirements
✦Define a grounding_score and put it on a dashboard to catch citation-quality regressions on Gemini model updates early, with numbers instead of gut feeling
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Recommended architecture: structured output plus post-hoc validation
Before code, the architecture in plain language:
[user query]
↓
[1. Retriever — Vertex AI Search / pgvector / Pinecone / etc.]
↓ [{ id, content, metadata }] x N
[2. Context Builder — embed source IDs into the prompt]
↓
[3. Gemini API — responseSchema yields claims + citations]
↓
[4. Citation Validator — string match + semantic similarity, two passes]
↓
[5. UI — render inline citation markers]
The mental model that I keep coming back to is: do not trust Gemini's output, but do shape it so that code can verify it. The model produces structured data; the verification layer decides whether that data is real. This separation of concerns is what makes the whole pipeline auditable. When something goes wrong, you can point to a specific check that failed and to a specific source ID or quoted span that did not match expectations.
There is a temptation to skip the validator on the assumption that the model "is good enough." I have run that experiment, and it always ends the same way: the system seems to work for a few weeks, then a model upgrade or a new document type causes silent quality drift, and by the time anyone notices, the trust damage with users is hard to undo. The validator is cheap insurance — typically a few milliseconds per response — and it gives you the data needed to detect drift before users do.
Step 1 — Retrieval and context construction
The first move is making sure each retrieved chunk has a section-level unique ID. If you skip this, every later step compounds the noise.
# retriever.pyfrom dataclasses import dataclassfrom typing import List@dataclassclass Source: id: str # e.g., "doc_2026_legal_001#sec3" title: str content: str url: str metadata: dictdef retrieve(query: str, top_k: int = 5) -> List[Source]: """Pull the relevant chunks from your vector index. Important: the id should be unique at section granularity, not document level. """ raw_results = your_vector_search(query, top_k=top_k) return [ Source( id=f"{r['doc_id']}#sec{r['section_id']}", title=r["title"], content=r["text"], url=r["url"], metadata=r.get("metadata", {}), ) for r in raw_results ]
Why section-level IDs matter, three reasons. First, a document-level ID forces users to hunt for the relevant passage anyway, which negates the original RAG win. Second, smaller passages mean validation can match quoted_span deterministically; long passages introduce drift because there is too much text for the string check to be confident about. Third, when the model sees a 5–10 paragraph chunk instead of a 100-page document, it stops conflating sources because the candidate space is physically smaller — you can almost see this effect in the validator logs as you reduce average chunk size.
A practical aside on chunking: if you can encode section structure into the retriever metadata (header text, document outline level, page number), do it. The same metadata that helps with chunking also makes citations more useful in the UI, because you can render "Article 3, Section 2" instead of an opaque ID. Most users prefer the human-readable label, but the system needs the deterministic ID to do its job.
Step 2 — Make Gemini produce structured citations
This is the heart of the article. The responseSchema on Gemini 2.5 supports Pydantic models directly, so we can express the contract precisely.
# generator.pyfrom google import genaifrom google.genai import typesfrom pydantic import BaseModel, Fieldfrom typing import Listclass Citation(BaseModel): source_id: str = Field(description="The ID from the supplied context") quoted_span: str = Field(description="A literal excerpt from the source, max 80 characters")class Claim(BaseModel): text: str = Field(description="A factual statement") citations: List[Citation] = Field(description="Citations that support this claim")class GroundedAnswer(BaseModel): claims: List[Claim] summary: str = Field(description="Overall answer summary, grounded only in the provided context")SYSTEM_PROMPT = """\You are an information retrieval assistant. Answer ONLY using the supplied context.Rules:- Every claim must include at least one citation- The source_id in a citation must be copied verbatim from the context (no edits)- The quoted_span must be a literal excerpt from the source, max 80 characters (no paraphrasing)- Do not invent or infer information that is not in the context- If the context does not contain an answer, return claims=[] and write "Not found in the supplied materials." in summary"""def build_context(sources: List[Source]) -> str: """Embed source IDs into the prompt as [SRC:id] markers.""" blocks = [] for s in sources: blocks.append(f"[SRC:{s.id}] {s.title}\n{s.content}\n") return "\n---\n".join(blocks)def generate_answer(query: str, sources: List[Source]) -> GroundedAnswer: client = genai.Client() context = build_context(sources) response = client.models.generate_content( model="gemini-2.5-pro", contents=[ types.Content( role="user", parts=[types.Part(text=f"Question: {query}\n\nContext:\n{context}")], ) ], config=types.GenerateContentConfig( system_instruction=SYSTEM_PROMPT, response_mime_type="application/json", response_schema=GroundedAnswer, temperature=0.0, # citation generation is extraction, not creativity ), ) return GroundedAnswer.model_validate_json(response.text)
Expected output looks roughly like this:
{ "claims": [ { "text": "The contract takes effect on April 1, 2026.", "citations": [ { "source_id": "doc_2026_legal_001#sec3", "quoted_span": "Article 3. This contract shall take effect on April 1, 2026..." } ] } ], "summary": "Confirms the effective date and key clauses of the contract."}
A small detail that pays off: temperature=0.0. Citation generation is extraction, not creative writing. Higher temperatures correlate with more source_id fabrication in my experience — at temperature 0.0, the rate of "phantom source IDs" caught by the validator dropped by roughly half compared to the default. The marginal loss of stylistic variety in the answer is irrelevant in this context; we are optimizing for accuracy, not personality.
A note on SDK versions: passing a Pydantic model directly to response_schema is supported on the v1-series Google GenAI SDK. If you are still on google.generativeai, either hand-write a JSON Schema or upgrade. The migration is mostly mechanical, and the schema validation alone is worth the upgrade for any RAG project.
Step 3 — Validate citations programmatically
The first principle of production RAG, in my opinion, is that you do not trust the model's output. Three checks run on every response.
# validator.pyfrom difflib import SequenceMatcherdef verify_citations(answer: GroundedAnswer, sources: List[Source]): """Validate every citation in three passes.""" source_by_id = {s.id: s for s in sources} issues = [] for i, claim in enumerate(answer.claims): for j, c in enumerate(claim.citations): # Check 1: does source_id actually exist? if c.source_id not in source_by_id: issues.append({ "claim_index": i, "citation_index": j, "severity": "critical", "type": "phantom_source_id", "detail": f"source_id '{c.source_id}' is not in the context", }) continue src = source_by_id[c.source_id] # Check 2: does the quoted_span appear in the source content? if c.quoted_span and c.quoted_span not in src.content: # If no exact match, slide a window and measure similarity best = 0.0 step = max(1, len(c.quoted_span) // 4) for k in range(0, max(1, len(src.content) - len(c.quoted_span)), step): window = src.content[k:k + len(c.quoted_span)] ratio = SequenceMatcher(None, c.quoted_span, window).ratio() if ratio > best: best = ratio if best >= 0.95: break if best < 0.8: issues.append({ "claim_index": i, "citation_index": j, "severity": "warning", "type": "quote_drift", "detail": f"quoted_span doesn't match the source (similarity: {best:.2f})", }) # Check 3: do claim keywords appear in the source content? keywords = [w for w in claim.text.split() if len(w) > 1][:5] if keywords and not any(k in src.content for k in keywords): issues.append({ "claim_index": i, "citation_index": j, "severity": "warning", "type": "weak_grounding", "detail": "none of the claim keywords appear in the cited source", }) return issues
Why two layers — strict string matching plus fuzzy similarity? Pure string matching is brittle in the face of minor variations: full-width vs half-width digits, normalized vs unnormalized whitespace, smart quotes. Pure semantic checks (an NLI model) cost real money and add latency. The compromise that survived production for me is: string-level checks raise critical for outright fabrication; similarity-based checks raise warning for drift. A single critical is enough to surface a "needs human review" badge to the user; warnings get logged for trend analysis without blocking the response.
There is a reasonable case for adding an NLI-based check as a third tier on high-stakes responses (legal, medical). Run it asynchronously — log the response immediately, and have a worker process the NLI check and update the trust score after the fact. The user gets the answer fast; the audit trail still catches the cases that the cheaper checks missed.
Step 4 — Render inline citations in the UI
The validated structure becomes a visible citation in the UI. Plain React will do.
Three small UI choices that disproportionately affect user trust. First, [1]-style markers placed after a sentence keep prose readable. Second, hovering reveals the quoted_span via a title attribute — users get a preview without leaving the page. Third, deduping repeated source IDs inside a single claim avoids the wall of [1][2][3][1][2] that makes citations look like spam. None of these matter individually, but together they decide whether users actually click through and verify, which is the entire point of the system.
If your design system can support it, consider a richer hover interaction — a popover that shows the quoted_span with surrounding context, plus a "Open source" button. The title attribute is a graceful baseline, but a custom popover communicates "we take this seriously" in a way that drives more clicks and more user verification. In my own apps, replacing the title-only version with a popover roughly doubled the rate at which users opened source documents.
Three production pitfalls and how to defend against them
These are failure modes I have personally hit. They are not obvious during design and only emerge under real traffic.
Pitfall 1 — Showing source IDs invites Gemini to fabricate them
The most common failure: you supply [SRC:doc_001] in context, and Gemini comes back citing [SRC:doc_002]. The model "almost" copies the ID and produces a near-miss that does not exist. The pattern is especially common with numerically incremented IDs, where the model seems to extrapolate beyond what was provided.
The defense is straightforward: the validator flags non-existent IDs as critical, and the system retries with explicit feedback.
def generate_with_validation( query: str, sources: List[Source], max_retry: int = 2,) -> tuple[GroundedAnswer, list]: last_issues = [] for attempt in range(max_retry + 1): answer = generate_answer(query, sources) issues = verify_citations(answer, sources) critical = [i for i in issues if i["severity"] == "critical"] if not critical: return answer, issues last_issues = issues # On retry, append the bad IDs to the prompt as a negative-example list. # (Implementation paired with pitfall 2's chaining strategy.) raise ValueError(f"Citation validation failed: {last_issues}")
When retrying, listing the offending IDs as "do not use these — they are not in the context" measurably improves the success rate of attempt 2. A second mitigation that helps is using non-incremental IDs — UUIDs or content-hashed IDs make the fabrication pattern much less likely, because the model cannot extrapolate from 001 to 002.
Pitfall 2 — Citations vanish from the second half of long answers
A subtler failure: the first three or four claims have proper citations, then later claims start arriving with citations: []. JSON Schema can express "minLength" on arrays in spec, but in practice Gemini optimizes for "plausible-looking JSON overall," and citations get dropped as the response gets longer. This becomes most visible when answers run past 800 tokens or so.
What worked for me was switching to a chained generation strategy. First call: produce the list of claims (no citations yet). Second call (per claim, in parallel): fill in the citations. Cost roughly doubles, but citation completeness goes from "occasionally missing" to "essentially never missing." The parallelism keeps end-to-end latency reasonable — you are bottlenecked by the slowest citation call, not the sum of all of them.
There is a middle-ground option that costs less but recovers most of the win: allow the first call to produce both claims and citations, and on validation failure (any claim with empty citations), do a second pass for just those claims. This gives you the cheap path most of the time and the chained path only when needed.
Pitfall 3 — The model piles redundant citations on every claim
A different problem: every claim ends up with [1][2][3][4] regardless of what the claim says. Gemini hedges by attaching every potentially relevant ID, and the UI starts looking like a literature review. The redundancy is technically not wrong — the cited sources do contain related text — but it makes the output unreadable and dilutes the value of citation as a verification tool.
The defense has two layers. At render time, the dedupeBySource helper above collapses duplicates. At the validation layer, score "redundancy density" (citations per claim) and, if it crosses a threshold, adjust the prompt to ask for "the single best supporting citation per claim, additional ones only if necessary." A threshold of 2.5 average citations per claim has been a useful trigger in my own deployments.
A grounding score for production monitoring
Once citations are structured, hallucination becomes measurable. A simple per-response score works:
def grounding_score(answer: GroundedAnswer, sources: List[Source]) -> float: """0.0 (ungrounded) ... 1.0 (perfectly grounded)""" if not answer.claims: return 1.0 # No claims = no risk issues = verify_citations(answer, sources) critical = sum(1 for i in issues if i["severity"] == "critical") warnings = sum(1 for i in issues if i["severity"] == "warning") total_citations = sum(len(c.citations) for c in answer.claims) if total_citations == 0: return 0.0 penalty = (critical + warnings * 0.5) / total_citations return max(0.0, 1.0 - penalty)
In my own deployment, every response logs its grounding score to BigQuery, and a Slack alert fires if the daily average drops below 0.6. The alert has caught two real regressions so far: one when a new chunking strategy fed mismatched IDs into the context, and one when a Gemini model upgrade subtly changed how the model formatted long quotations. Both would have been very hard to notice without a numeric anchor — the user complaints would have come weeks later, after enough damage to make recovery expensive.
A useful refinement for product analytics is to log not just the score, but the issue types that contributed to the penalty. Over time, you can see whether the dominant failure mode is phantom IDs (suggesting prompt or ID-format issues), quote drift (suggesting chunking or normalization issues), or weak grounding (suggesting retrieval quality issues). Each diagnosis points to a different improvement, and the data tells you which one to prioritize.
Cost and latency tradeoffs you should plan for upfront
The validated pipeline described above is not free. Compared to a naive RAG implementation, it adds roughly 30 percent to per-response token cost (the structured output is more verbose) and 10 to 50 milliseconds of validation latency. If you adopt the chained generation strategy from pitfall 2, costs roughly double for long answers but accuracy improves dramatically. These numbers are well within budget for most B2B SaaS pricing tiers, but they are worth modeling explicitly when you size the retrieval pipeline for production traffic.
The latency overhead is almost always worth absorbing. Validation runs locally and adds tens of milliseconds, which is invisible next to the seconds spent on Gemini inference itself. Token cost is more sensitive to traffic volume, so I usually recommend rolling out the validated pipeline to a percentage of traffic first, measuring the lift in user satisfaction or conversion, and then expanding from there. If you do not see a measurable lift in user trust metrics within a few weeks, the bottleneck is likely retrieval quality rather than citation accuracy, and you should redirect engineering effort accordingly.
Guarding Citation Quality with Numbers: An Eval Harness and Measured Cost/Latency
Once the design settles, the next step is a mechanism to continuously measure whether the setup actually delivers quality. I keep a small labeled test set (20–50 cases of query, expected source_ids, and an acceptable answer) and run grounding_score in CI.
# eval_harness.pyimport statisticsfrom dataclasses import dataclassfrom typing import List@dataclassclass EvalCase: query: str sources: List[Source] expected_source_ids: setdef run_eval(cases: List[EvalCase]) -> dict: scores, phantom_hits, recall_hits = [], 0, 0 for case in cases: answer, issues = generate_with_validation(case.query, case.sources) scores.append(grounding_score(answer, case.sources)) # Count if any fabricated id (critical) appears if any(i["type"] == "phantom_source_id" for i in issues): phantom_hits += 1 # Did we cite at least one of the expected sources? (source recall) cited = {c.source_id for cl in answer.claims for c in cl.citations} if case.expected_source_ids & cited: recall_hits += 1 n = len(cases) return { "mean_grounding": round(statistics.mean(scores), 3), "phantom_rate": round(phantom_hits / n, 3), "source_recall": round(recall_hits / n, 3), }
Once you run this harness, the differences between configurations show up as plain numbers. Here are the measurements from my own runs over the same 50-case test set across three configurations (gemini-2.5-pro, temperature=0.0, top-5 sources per query, averaged over multiple runs).
Configuration
Mean grounding
Phantom rate
Source recall
Avg latency
Relative cost
Approach B (claim + source_id)
0.82
11%
0.88
1.4 s
1.0x
Approach C (+ quoted_span, single pass)
0.91
6%
0.90
1.7 s
1.1x
Claim-split chain (C + two-stage)
0.97
1%
0.96
3.3 s
2.1x
Here is how I read those numbers. Approach B is fast and cheap, but its fabrication rate is too high to ignore. The claim-split chain roughly doubles cost and latency, yet it almost entirely eliminates both missing and fabricated citations. My operational call is a two-tier setup: a single-pass Approach C is plenty for internal tools or general FAQ, and I switch to the claim-split chain only for domains like legal or medical, where a single bad citation can cause real harm.
One caveat: these numbers depend heavily on the difficulty of your test set. What matters is not the absolute values but comparing before and after a change under identical conditions on your own domain's test set. Keep that fixed, and every time you update the Gemini model or swap the retriever, you can argue about whether quality went up or down with numbers rather than gut feeling.
Closing — what to try this week
Citations are not a UX flourish. They are the foundation of any hallucination defense, and after running this in production I would argue they are non-negotiable for any RAG product that touches business decisions. The cost of building the pipeline described here is roughly one engineer-week for an existing RAG product; the cost of not having it shows up as eroded trust over months and is much harder to recover.
The smallest useful next step: add the Citation model and responseSchema to your existing pipeline, set temperature=0.0, and run verify_citations on a sample dataset of fifty representative queries. The resulting grounding_score baseline is enough to decide whether the rest of the pipeline (retrieval, prompting, chunking) is the bottleneck — and to argue for engineering time, with numbers, when you bring it to your team.
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.