GEMINI LABJP
MODEL — Gemini 3.5 Flash reaches GA and now powers gemini-flash-latestAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Google-hosted Linux sandboxesSEARCH — File Search adds multimodal search, embedding and searching images natively via gemini-embedding-2WEBHOOK — Event-driven webhooks arrive for the Batch API and long-running operations, replacing pollingEMBED — gemini-embedding-2 is now generally available for production embeddingsDEPRECATION — Several image generation models shut down on August 17, so plan migrations nowMODEL — Gemini 3.5 Flash reaches GA and now powers gemini-flash-latestAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Google-hosted Linux sandboxesSEARCH — File Search adds multimodal search, embedding and searching images natively via gemini-embedding-2WEBHOOK — Event-driven webhooks arrive for the Batch API and long-running operations, replacing pollingEMBED — gemini-embedding-2 is now generally available for production embeddingsDEPRECATION — Several image generation models shut down on August 17, so plan migrations now
Articles/API / SDK
API / SDK/2026-07-04Advanced

Catching the Rows That Quietly Failed Overnight: A Per-Row Retry Ledger for the Gemini Batch API

A SUCCEEDED batch job is not the same as all-rows-succeeded. From running nightly batches as a solo developer, here is a per-row result ledger, a transient-vs-permanent failure classifier, selective retries, and a guard against retrying permanent failures forever, with a working SQLite state machine.

gemini-api263batch-api3retry-designidempotency4indie-dev40operations9

Premium Article

In the morning, the batch job status read SUCCEEDED. I wrote the results back with a clear conscience and moved on to other work for the day.

A few days later, while scanning the classifications, I noticed something. One cluster of reviews had landed in Firestore with an empty category. A few dozen rows. The job as a whole had "succeeded," yet some of the rows inside it had quietly failed.

As an indie developer running several apps and four sites, the nightly batch becomes a "start it and go to sleep" tool. In my own case, classifying the reviews that pile up in App Store Connect and Google Play Console, I built the next stage of processing on the assumption that everything would be present by morning. That is exactly why a partial failure like this leaks downstream and contaminates later steps.

This article is about dropping the habit of reading a batch completion as "all rows succeeded," and instead recording results per row in a ledger and picking up only the rows that fell through. It focuses not on the first implementation of nightly processing, but on the question that always follows it: how do you recover the few dozen rows that failed?

"Completed" and "all succeeded" are different

The state of a batch job and the success or failure of each individual request inside it live on different layers. The job can finish cleanly while the output JSONL mixes successful responses and errors line by line.

A single output line usually takes one of the two shapes below. Key names shift a little between SDK versions, so I recommend peeking at your actual output with head before you build against it.

{"key": "review-000512", "response": {"candidates": [ ... ]}}
{"key": "review-000513", "error": {"code": 400, "message": "..."}}

In other words, even when the job is SUCCEEDED, rows carrying an error are perfectly normal. What I dropped were an aggressive review body caught by the safety filter, and a body made up entirely of emoji that failed schema extraction.

Here is one idea worth holding onto: record the job's outcome and each row's outcome separately. Mix the two layers into "the job succeeded, so insert everything," and a hole will always open up.

Keep a per-row ledger

So we prepare a ledger that holds a state for every request we submit. The key is the custom_id (here, key). I narrowed the states down to four.

StateMeaningWhat to do next
pendingNo result received yetInclude in the next batch
succeededA valid structured output was obtainedNothing (finalized)
retryableTransient failure (429 / 503, etc.)Resubmit up to the attempt cap
permanentInput- or safety-driven; a retry will not fix itDo not resubmit; route to a human

I chose SQLite because, for a solo developer's nightly batch, "one portable file that survives a mid-run crash" matters more than anything. The ledger schema and initialization are just this.

import sqlite3
import time
 
def open_ledger(path: str = "batch_ledger.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS rows (
            key           TEXT PRIMARY KEY,
            payload       TEXT NOT NULL,   -- input request (JSON string)
            status        TEXT NOT NULL DEFAULT 'pending',
            attempts      INTEGER NOT NULL DEFAULT 0,
            last_error    TEXT,
            result        TEXT,            -- extracted result on success (JSON string)
            updated_at    REAL NOT NULL DEFAULT 0
        )
        """
    )
    conn.commit()
    return conn
 
 
def enroll(conn: sqlite3.Connection, key: str, payload: str) -> None:
    """Register as pending on first submission. Do not touch on resubmit."""
    conn.execute(
        "INSERT OR IGNORE INTO rows(key, payload, updated_at) VALUES (?, ?, ?)",
        (key, payload, time.time()),
    )
    conn.commit()

The INSERT OR IGNORE is the crucial part. Resubmitting the same key on night two must not roll a succeeded row back to pending. The ledger exists to protect a success once it is finalized.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Build a SQLite ledger with four states (pending / succeeded / retryable / permanent) that reconciles the Batch output JSONL by custom_id, in copy-paste-ready code
Learn to separate transient failures (429, 503) from permanent ones (safety blocks, invalid input) and set a stop condition so permanent rows are never retried forever
See how to resubmit only the failed rows on night two and three, and how the ledger absorbs the cost-accounting drift that spans retry attempts
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-06-21
Classifying 8,000 App Reviews Overnight with Gemini Batch API — and Moving Polling to Webhooks
Implementation notes on clearing ~8,000 backlogged app reviews from six iOS/Android apps with the Gemini Batch API in a single night — now extended with the June 2026 event-driven Webhooks that replace the morning polling step. Real cost and runtime numbers, composite-key design, hung-job triage, and deprecation discipline, with working code.
API / SDK2026-06-30
Letting Gemini Listen to a Long Track and Build Its Chapters — Timestamped Structured Extraction
How I replaced hours of hand-chaptering long healing-audio tracks with Gemini's audio understanding: uploading long files via the Files API, pinning JSON output with response_schema, and the validation code that catches audio-specific quirks like timestamp drift and phantom silence.
API / SDK2026-06-26
Reliable Text-in-Image with Gemini 3.1 Flash Image — an OCR-Verified Pipeline
After the preview shutdown, the GA gemini-3.1-flash-image still occasionally garbles text baked into images. Here is a generate -> read-back-verify -> regenerate/composite pipeline, with working code and an unattended retry budget.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →