What My Automation Pipelines Did While Gemini Was Down

One morning in June, I opened the logs from the previous night's batch runs and found failures stacked top to bottom. App review analysis, image metadata generation, the groundwork for my daily operations report — jobs that normally finish without a sound had all stopped with the same face. A quick look around confirmed it: Gemini was returning error 1076 and error 1099 across a wide surface, and the press was calling it the largest outage the service had seen.

I will be honest about the first thing I checked. It was not the recovery ETA. It was how my own jobs had failed. The outage itself was never mine to fix — the only thing I actually control is how my systems behave while the API underneath them is unavailable. That behavior is decided long in advance, by design, and an outage is the rare chance to grade it against reality. These are my notes from that grading.

The facts, kept short

From the outside, the visible sequence was this: error 1076 / error 1099 appeared broadly across Gemini surfaces, Google's engineering team applied mitigations, and service recovered in stages. As a user you can never see precisely which internal systems were affected, so everything below sticks to what I could observe from my own machines.

For mornings that smell like an outage, I keep a fixed reading order:

My own job logs first (which stage failed, with which error)
Google Cloud Service Health and official announcements (scope of impact)
Social media (to separate "my bug" from "everyone's outage")

The order matters. External sources tell you which part of the world is broken, but only your own logs tell you what you left unfinished.

During the outage, the pipelines failed exactly the way they were told to

A little context first. As an indie developer I run a number of daily jobs built on the Gemini API — one that organizes wallpaper-app reviews before I wake up, one that generates metadata for store images, one that summarizes the previous day's metrics. They are spread across off-peak hours, late night to early morning, on the assumption that they finish while I sleep.

When I first wired this up, I settled on one rule: retry, then shelve, then log and exit quietly. On error, retry a few times with exponential backoff. If that fails, move the unfinished work into a local pending area, and exit as a planned completion rather than a crash.

Reading the logs from the outage night, every job had given up in precisely that order. Three retries, all failed; unprocessed items shelved as PENDING; exit code 0. All I had to do the next morning was feed the shelved items back into the queue. As for whether re-running work could produce duplicates — that part was already covered by the idempotency scheme I described in my notes on idempotency key design for the Gemini API, and it held up without modification.

Three design decisions the outage graded for me

Retry limits are about when I am allowed to give up, not when the API recovers

Retries sound like a mechanism for waiting out trouble. Against a genuine large-scale outage, they are nearly useless — something that has been down for half an hour across regions will still be down on your third attempt.

I keep retries anyway, for a different reason: they let the code distinguish a transient wobble (a timeout, a stray 5xx) from a real outage. If a few attempts fix it, it was a wobble. If they do not, it is an outage, and the correct move is to stop pushing and start shelving. Retry, in my view, is less a recovery tool than a way to delegate the decision to give up.

The last rung of degradation is "do nothing today"

In an earlier piece on building a circuit breaker and graceful degradation around the Gemini API, I described four levels of fallback: full response, simplified response, cache, and stop. What this outage drove home is how much it pays to make that final "stop" a legitimate state instead of a shameful one.

If review analysis skips a day, no user is harmed. Tomorrow's run processes two days and catches up. Once "we do nothing today" is a first-class outcome, nobody has to stay awake babysitting an outage. Leave it ambiguous, and you get the worst morning instead: half-succeeded jobs tangled with half-failed ones, and recovery becomes the heaviest task of the day.

Logs are letters to tomorrow's me

What helped most on the morning after was the logging. Which job, how far it got, which error stopped it, what was shelved where. With those four facts in every line, the re-run list after recovery came out of a single grep.

When I write log lines I picture the reader as tomorrow-morning me, not currently-panicking me. Instead of dumping raw error messages, each line carries what the next decision needs: job name, stage reached, shelf location, and the arguments required to re-run. Do that, and incident response is half finished before the incident.

The one change I made afterward

The system mostly behaved as designed, but rereading the logs left me wanting to fix a single thing: every job had retried and given up individually. With five jobs in one night, that is five times three — fifteen wasted attempts against an API that was plainly down. An outage is not a job-level problem; it is a pipeline-level problem, and one decision per night is enough.

So I now put one lightweight health check at the very front of the nightly batch group.

# Lightweight health check, run once at the front of the nightly batch group.
# If this fails, skip the whole group and let tomorrow's run catch up.
import sys
import time
 
from google import genai
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
# Error markers worth retrying (transient wobbles)
RETRYABLE = ("500", "502", "503", "504", "timeout", "deadline", "unavailable")
 
 
def preflight(max_retries: int = 2) -> bool:
    for attempt in range(max_retries + 1):
        try:
            client.models.generate_content(
                model="gemini-3.5-flash",
                contents="ping",
            )
            return True  # one success is all we need
        except Exception as e:
            message = str(e).lower()
            if not any(key in message for key in RETRYABLE):
                raise  # auth/config mistakes won't heal with retries — surface them now
            time.sleep(2 ** attempt)  # 1s -> 2s -> 4s
    return False
 
 
if not preflight():
    print("SKIP: Gemini API is unstable — skipping tonight's batch group")
    sys.exit(0)  # planned skip (0), not a failure (1)

On a healthy night the script prints nothing and the batches proceed. During an outage it prints a single SKIP line and exits with code 0. The zero is deliberate: it keeps the scheduler's alerting quiet. What I want from an outage night is not a notification that things stopped — it is the silence of things stopping exactly as designed.

The ping costs a handful of tokens, which rounds to nothing. In exchange, the wasted retries disappear and so does the pile of half-failed logs. Even at indie scale, that trade pays for itself immediately.

Your next step

If you operate scheduled jobs that depend on an external API, add one branch — just one — that treats "do nothing today" as a legitimate way to finish. A clean path for giving up will do more for your outage nights than any increase in retry counts.

Outages are a nuisance, but they are also a free audit of your design. If you run API-dependent automation of your own, I hope these notes are useful the next time the status page turns red.