⬡ Advanced/2026-06-14Advanced

Trusting Gemini Structured Output in Production — Schema Design, Double Validation, and Bounded Retries

Gemini's structured output guarantees parseable JSON, not correct values. Notes on schema design with @google/genai, why propertyOrdering matters, a Zod double-validation layer, handling MAX_TOKENS truncation, and a bounded-retry extraction pipeline.

gemini⁷⁶ structured-output¹⁵ json-schema zod³ production¹⁰⁶

✦ Premium Article

While running an invoice-sorting job, I noticed that a few records each month had a total that didn't match the sum of their line items. The JSON parsed fine. The schema validation was green. And yet the values were wrong.

This is where a lot of people stumble when they put structured output into production. Add responseMimeType: "application/json" and you really do get parseable JSON every time. But "parseable" and "correct for the business" are two different things. If you don't draw that line up front, quietly broken data flows downstream.

What follows is the order in which things actually mattered for me in production: current schema design with @google/genai, a two-layer validation approach, how to tell failures apart, and how to recover them. As of June 2026 the default model has moved up to Gemini 3.5 Flash and Structured Outputs is GA. With the behavior settling down, now feels like a good moment to firm up the design.

Why "structured output equals safe" doesn't hold

Structured output reliably guarantees roughly three things: the output follows the JSON types you specified, required fields are never missing, and values stay within the range you listed in enum. That's a big step up from the days of peeling JSON out of free-form text with regular expressions.

It just as clearly does not guarantee others: whether a number makes business sense (does the line-item sum match the grand total?), whether a date is real (will 2026-02-30 be rejected?), or cross-field consistency (if paymentStatus is paid, is paidDate actually present?). All of those live outside the schema.

In other words, structured output is the layer that guarantees shape, not the layer that guarantees meaning. In a production pipeline, keeping these as two separate pieces of code is, in my experience running this as an indie developer over the past few months, what ends up breaking least.

Schema design — teaching the model the shape

Here's the minimal setup with the current SDK. Moving from the old @google/generative-ai to @google/genai, calls consolidate into ai.models.generateContent, and configuration goes inside config.

// structured-review.ts
import { GoogleGenAI, Type } from "@google/genai";
 
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
 
// description acts as an instruction to the model. Write what goes in, not just the type
const reviewSchema = {
  type: Type.OBJECT,
  properties: {
    productName: { type: Type.STRING, description: "Name of the reviewed product" },
    rating: {
      type: Type.INTEGER,
      description: "Integer 1-5. Do not allow decimals or out-of-range values",
    },
    pros: {
      type: Type.ARRAY,
      items: { type: Type.STRING },
      description: "Positives. Only items grounded in the text",
    },
    cons: {
      type: Type.ARRAY,
      items: { type: Type.STRING },
      description: "Negatives. Only items grounded in the text",
    },
    sentiment: {
      type: Type.STRING,
      enum: ["positive", "neutral", "negative"],
      description: "Overall tone",
    },
  },
  // propertyOrdering pins the order the model generates fields in
  propertyOrdering: ["productName", "rating", "pros", "cons", "sentiment"],
  required: ["productName", "rating", "sentiment"],
};
 
export async function analyzeReview(reviewText: string) {
  const res = await ai.models.generateContent({
    model: "gemini-3.5-flash",
    contents: `Analyze the following review:\n\n${reviewText}`,
    config: {
      responseMimeType: "application/json",
      responseSchema: reviewSchema,
    },
  });
  return JSON.parse(res.text);
}

Three quiet things matter here.

description is not decoration; it's read as a real instruction. Adding "do not allow decimals or out-of-range values" rather than just "integer 1-5" visibly reduced misbehavior at the boundaries. Think of it as the place to write your floor rules, not type documentation, and quality goes up.

propertyOrdering is easy to overlook, but fixing the generation order improves stability. The model uses earlier fields as context for later ones, so putting rating before pros/cons makes the score and its reasons less likely to disagree. Conversely, putting an important judgment field last lets it get dragged around by the verbose arrays in front of it.

required stays minimal. Make everything required and the model will fabricate values for fields it can't fill. Letting the schema say "omit if absent" actually cuts down hallucination.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A two-layer design that separates what responseSchema guarantees from business-consistency checks, with concrete code

✦How to use propertyOrdering, enum, and description to raise the model's output quality

✦Telling truncated and empty responses apart by finish_reason, and recovering them with bounded retries

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Nested extraction — close the gaps with enum

Real schemas aren't flat. When nesting and arrays mix, as in an invoice, push required down into child objects and always close status-like fields with enum.

// invoice-schema.ts
import { Type } from "@google/genai";
 
export const invoiceSchema = {
  type: Type.OBJECT,
  properties: {
    invoiceNumber: { type: Type.STRING, description: "Invoice number" },
    issueDate: {
      type: Type.STRING,
      description: "Issue date. Always a real date in YYYY-MM-DD format",
    },
    vendor: {
      type: Type.OBJECT,
      properties: {
        name: { type: Type.STRING },
        taxId: { type: Type.STRING, description: "Registration number; omit if absent" },
      },
      required: ["name"],
    },
    lineItems: {
      type: Type.ARRAY,
      items: {
        type: Type.OBJECT,
        properties: {
          description: { type: Type.STRING },
          quantity: { type: Type.NUMBER },
          unitPrice: { type: Type.NUMBER, description: "Unit price, pre-tax" },
          amount: { type: Type.NUMBER, description: "quantity x unitPrice (pre-tax)" },
        },
        propertyOrdering: ["description", "quantity", "unitPrice", "amount"],
        required: ["description", "quantity", "unitPrice", "amount"],
      },
    },
    grandTotal: { type: Type.NUMBER, description: "Total including tax" },
    paymentStatus: {
      type: Type.STRING,
      enum: ["paid", "pending", "overdue"],
    },
    currency: { type: Type.STRING, enum: ["JPY", "USD", "EUR"] },
  },
  propertyOrdering: [
    "invoiceNumber", "issueDate", "vendor",
    "lineItems", "grandTotal", "paymentStatus", "currency",
  ],
  required: ["invoiceNumber", "issueDate", "lineItems", "grandTotal", "paymentStatus", "currency"],
};

Leave a status field like paymentStatus or currency as a free string and you will inevitably get Paid, PAID, and payed mixed together. Close it with enum and downstream branching reduces to simple equality. Just tightening this cut the bugs in later stages roughly in half for me.

Two layers — keep shape and meaning separate

The schema guarantees shape and no further. Pull meaning out into its own layer. I use Zod not as a "re-check of the schema" but as the guardian of business rules.

// invoice-validate.ts
import { z } from "zod";
 
const InvoiceZ = z.object({
  invoiceNumber: z.string().min(1),
  issueDate: z.string().refine(
    (s) => !Number.isNaN(Date.parse(s)) && /^\d{4}-\d{2}-\d{2}$/.test(s),
    { message: "Not a real YYYY-MM-DD date" },
  ),
  vendor: z.object({ name: z.string().min(1), taxId: z.string().optional() }),
  lineItems: z.array(z.object({
    description: z.string(),
    quantity: z.number().positive(),
    unitPrice: z.number().min(0),
    amount: z.number().min(0),
  })).min(1),
  grandTotal: z.number().min(0),
  paymentStatus: z.enum(["paid", "pending", "overdue"]),
  currency: z.enum(["JPY", "USD", "EUR"]),
});
 
export type Invoice = z.infer<typeof InvoiceZ>;
 
// Verify cross-field consistency that the schema cannot express
export function checkConsistency(inv: Invoice): string[] {
  const issues: string[] = [];
 
  const lineSum = inv.lineItems.reduce((s, it) => s + it.amount, 0);
  // JPY has a minimum unit of 1 yen, so rounding noise is rare. For multi-currency, keep per-currency tolerances
  const tolerance = inv.currency === "JPY" ? 1 : 0.01;
  if (Math.abs(lineSum - inv.grandTotal) > tolerance) {
    issues.push(`Line sum ${lineSum} does not match grand total ${inv.grandTotal}`);
  }
 
  for (const it of inv.lineItems) {
    const expected = it.quantity * it.unitPrice;
    if (Math.abs(expected - it.amount) > tolerance) {
      issues.push(`Subtotal for "${it.description}" does not equal quantity x unitPrice`);
    }
  }
  return issues;
}
 
export function validateInvoice(raw: unknown):
  | { ok: true; data: Invoice; warnings: string[] }
  | { ok: false; error: string } {
  const parsed = InvoiceZ.safeParse(raw);
  if (!parsed.success) {
    return { ok: false, error: parsed.error.issues.map((i) => i.message).join("; ") };
  }
  return { ok: true, data: parsed.data, warnings: checkConsistency(parsed.data) };
}

The key is not treating a consistency violation as an immediate error. A schema violation (broken shape) is something to retry; a consistency violation (totals don't add up) is something to route to a human review queue. The "a few mismatches each month" from the opening lands exactly in warnings, so I can divert it to human eyes without stopping the automation.

The broken records that were leaking downstream did so precisely because this layer didn't exist. Shape validation alone was being read as "green."

Telling failures apart by finish_reason

The nastiest failure in structured output is JSON that cuts off partway. When a long array hits maxOutputTokens, the response is truncated mid-JSON. If you don't separate causes by finish_reason, a retry just dies at the same spot.

// safe-generate.ts
import { GoogleGenAI } from "@google/genai";
 
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
 
type GenResult<T> =
  | { ok: true; data: T }
  | { ok: false; reason: "truncated" | "blocked" | "parse" | "empty"; detail: string };
 
export async function safeGenerate<T>(
  prompt: string,
  schema: object,
  validate: (raw: unknown) => T,
): Promise<GenResult<T>> {
  const res = await ai.models.generateContent({
    model: "gemini-3.5-flash",
    contents: prompt,
    config: {
      responseMimeType: "application/json",
      responseSchema: schema,
      maxOutputTokens: 8192,
    },
  });
 
  const finish = res.candidates?.[0]?.finishReason;
  // MAX_TOKENS signals "needs more budget." Retrying with the same budget is wasted
  if (finish === "MAX_TOKENS") {
    return { ok: false, reason: "truncated", detail: "Reached maxOutputTokens" };
  }
  if (finish === "SAFETY" || finish === "PROHIBITED_CONTENT") {
    return { ok: false, reason: "blocked", detail: `Safety filter: ${finish}` };
  }
 
  const text = res.text;
  if (!text) return { ok: false, reason: "empty", detail: "Empty body" };
 
  try {
    return { ok: true, data: validate(JSON.parse(text)) };
  } catch (e) {
    return { ok: false, reason: "parse", detail: e instanceof Error ? e.message : String(e) };
  }
}

Swallow truncated and parse together as one "JSON error" and you'll apply the same recovery to causes that are nothing alike. truncated is about widening the budget or splitting the batch; parse is usually fixed by simply throwing the request again. Just splitting these two in the logs noticeably shortened triage time.

Bounded retries and concurrency control

Finally, the pipeline that processes many documents at once. It changes recovery strategy by reason and caps concurrency so it doesn't trip rate limits.

// pipeline.ts
import { safeGenerate } from "./safe-generate";
 
async function processOne<T>(
  doc: { id: string; content: string },
  schema: object,
  validate: (raw: unknown) => T,
  maxRetries = 2,
): Promise<{ id: string; status: "ok" | "review" | "failed"; data?: T; note?: string }> {
  let attempt = 0;
  while (attempt <= maxRetries) {
    const r = await safeGenerate(`Extract from:\n\n${doc.content}`, schema, validate);
    if (r.ok) return { id: doc.id, status: "ok", data: r.data };
 
    // truncated and blocked won't fix on retry. Send to a human immediately
    if (r.reason === "truncated" || r.reason === "blocked") {
      return { id: doc.id, status: "review", note: r.detail };
    }
    // parse / empty: throw again with exponential backoff
    await new Promise((res) => setTimeout(res, 800 * 2 ** attempt));
    attempt++;
  }
  return { id: doc.id, status: "failed", note: "Retry limit reached" };
}
 
// A simple worker pool that caps concurrency
export async function runPipeline<T>(
  docs: { id: string; content: string }[],
  schema: object,
  validate: (raw: unknown) => T,
  concurrency = 3,
) {
  const results: Awaited<ReturnType<typeof processOne<T>>>[] = [];
  const queue = [...docs];
  const workers = Array.from({ length: concurrency }, async () => {
    let doc;
    while ((doc = queue.shift())) {
      results.push(await processOne(doc, schema, validate));
    }
  });
  await Promise.all(workers);
  return results;
}

The crux is the three-state status: ok / review / failed. With a binary (success/failure), records that "a person could rescue but automation can't" get buried in the failure pile. Breaking out review keeps the automation rate up while leaving only the genuinely human-needed cases in your hands.

I default concurrency to 3 as a conservative value that avoids tripping free and low-tier rate limits. Flash models are fast, so raise it if you have headroom, but I find it safer to start low and measure the range where 429s don't appear.

The numbers I watch once it's live

After the design firmed up, what I measure shifted slightly away from "success rate." These days, in my own indie pipeline, I keep an eye on three figures:

The share of records carrying consistency warnings — shape correct, meaning suspect. When this rises, I suspect the validation rules, not the extraction.
The truncated rate — a sign the schema is too heavy or an array too long. It's what tells me whether to tune maxOutputTokens or split the call.
The backlog in the review queue — the one figure that makes human load visible. Let it clog and the automation rate stops meaning much.

Those point at which part of the pipeline to fix far better than a bare failure rate does.

With the default model moving up to 3.5 Flash, the same schema saw truncated tick up a little in places. Output got more verbose, so arrays hit the budget more easily. Whether to bump maxOutputTokens a notch or split arrays into a separate call is something I decide by watching the truncated rate.

Treat structured output as the layer that guarantees shape, stand up a separate guardian for meaning, and recover failures by cause. Keeping those three apart is enough to make a production data pipeline a great deal quieter. I hope it gives a foothold to anyone building similar invoice or inquiry automation.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.