GEMINI LABJP
MODEL — Gemini 3.5 Flash is generally available as Google's top pick for agentic and coding tasksAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Linux sandboxesWEBHOOK — Event-driven webhooks now cover the Batch API and long-running ops, removing pollingSECURITY — From June 19, requests from unrestricted API keys are blocked — review your key limitsDEPRECATED — Two image-preview models shut down June 25 — migrate any preview-dependent flowsCODEASSIST — Since June 18, individual Code Assist extensions and CLI stopped serving Pro/Ultra tiersMODEL — Gemini 3.5 Flash is generally available as Google's top pick for agentic and coding tasksAGENT — Managed Agents enter public preview in the Gemini API, running in isolated Linux sandboxesWEBHOOK — Event-driven webhooks now cover the Batch API and long-running ops, removing pollingSECURITY — From June 19, requests from unrestricted API keys are blocked — review your key limitsDEPRECATED — Two image-preview models shut down June 25 — migrate any preview-dependent flowsCODEASSIST — Since June 18, individual Code Assist extensions and CLI stopped serving Pro/Ultra tiers
Articles/API / SDK
API / SDK/2026-06-29Advanced

Stop Losing Silently-Failed Jobs in Your Unattended Gemini API Pipeline

When an unattended Gemini API batch drops a single job in silence, you may not notice for days. Here is a minimal dead-letter store and a safe replay flow — with copy-paste code and the operational judgment that makes it work.

gemini91gemini-api254batch2reliability6automation43

Premium Article

One morning I was scanning the logs of a batch job that runs at 2 a.m. and noticed that out of 50 inputs, exactly one had no result. No exception had been raised, the retries had quietly run out, and the job had vanished — no output, no record anywhere. The reason I didn't catch it for three days is that the failure had been handled not as an error, but as nothing.

Running automated jobs for several blogs and wallpaper apps as an indie developer, I've come to think these "silent drops" are the scariest kind of bug. A failure that shows up on a dashboard is fixable. A failure that leaves no trace gives you nothing to even notice. This article walks through how to keep an unattended Gemini API pipeline from discarding a single failed job — a dead-letter store plus a replay flow, written the way I actually run it.

More retries won't stop the silent disappearance

When people think about handling failures, retries come to mind first — but retries only solve half the problem. The other half is where a job goes after the retries are exhausted. Most code, once it exceeds max_attempts, moves on with a return None or a continue. The loop completes normally, the exit code is 0, and the failure is recorded nowhere.

There's a second trap: failures that are pointless to retry at all. A malformed input JSON, a permanent safety-filter block, an INVALID_ARGUMENT — these return the same result no matter how many times you throw them. Retrying here burns API spend and time, and then disappears in silence anyway.

So what you actually need isn't more retries — it's a terminal destination for failures. Things that might recover get tried again after a delay; things that won't recover get parked instead of dropped, so a human can deal with them later. That parking lot is the dead-letter store.

What to store in the dead letter

You don't need special middleware to start a dead-letter store — a single SQLite table is plenty. I confirm the flow on local SQLite first and move it to something like Firestore only when I need to. What matters is not where it lives but that it holds everything required to replay later.

Store the job's unique ID, the input itself (essential for replay), a fingerprint of the input (for deduplication), the model used, the failure class, the last error message, the attempt count, the first and most recent timestamps, and a status.

import sqlite3
import json
import time
import hashlib
 
DLQ_PATH = "dead_letter.db"
 
 
def init_dlq(path: str = DLQ_PATH) -> None:
    """Create the minimal dead-letter table."""
    con = sqlite3.connect(path)
    con.execute(
        """
        CREATE TABLE IF NOT EXISTS dead_letter (
            job_id        TEXT PRIMARY KEY,
            payload_hash  TEXT NOT NULL,
            payload_json  TEXT NOT NULL,
            model         TEXT,
            error_class   TEXT NOT NULL,
            error_message TEXT,
            attempts      INTEGER NOT NULL,
            first_seen    REAL NOT NULL,
            last_seen     REAL NOT NULL,
            status        TEXT NOT NULL DEFAULT 'parked'
        )
        """
    )
    con.commit()
    con.close()
 
 
def payload_fingerprint(payload: dict) -> str:
    """Build a stable fingerprint that identifies the same input."""
    canonical = json.dumps(payload, sort_keys=True, ensure_ascii=False)
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:16]
 
 
def park(job_id: str, payload: dict, model: str,
         error_class: str, error_message: str, attempts: int,
         path: str = DLQ_PATH) -> None:
    """Park a permanently-failed job in the dead letter (update if it exists)."""
    now = time.time()
    con = sqlite3.connect(path)
    con.execute(
        """
        INSERT INTO dead_letter
            (job_id, payload_hash, payload_json, model,
             error_class, error_message, attempts, first_seen, last_seen, status)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, 'parked')
        ON CONFLICT(job_id) DO UPDATE SET
            last_seen     = excluded.last_seen,
            attempts      = excluded.attempts,
            error_class   = excluded.error_class,
            error_message = excluded.error_message
        """,
        (job_id, payload_fingerprint(payload),
         json.dumps(payload, ensure_ascii=False), model,
         error_class, error_message, attempts, now, now),
    )
    con.commit()
    con.close()

Making job_id the primary key and using ON CONFLICT ... DO UPDATE means that when the same job fails again, you update last_seen and the attempt count instead of adding a row. Without this, one recurring permanent failure can fill the dead letter with hundreds of rows and bury the breakdown you actually want to read.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
You can drop in a minimal dead-letter store so a nightly batch never loses a single quietly-failed job
You'll be able to separate failures worth retrying from permanent ones, stopping both infinite retries and silent disappearance
You'll get a safe, capped replay flow to recover parked jobs in bulk after you've fixed the root cause
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API / SDK2026-06-17
Moving My Automation Off the Gemini CLI Before the June 18 Shutdown
On June 18, the Gemini CLI stops responding for hosted plans. Here is how I moved unattended scripts that called gemini from the shell over to the google-genai SDK, with structured output, retries, and cost measurement built in.
API / SDK2026-06-13
The Morning Gemini Generated Fine but the Publish Crashed — A 'Generation Outbox' So Expensive Output Is Never Lost
Generation succeeds, then the process dies right before publishing. The expensive output is gone, and you pay for the same generation again. Here is a 'generation outbox' that persists the output first and turns publishing into an idempotent follow-up, plus what it did for me during the June outage.
API / SDK2026-03-29
Automating Multilingual Translation and Localization with Gemini API
Learn how to automate multilingual translation and app localization using Gemini API. Covers Python implementation, glossary management, batch processing, and quality checks.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →