It was somewhere around my twenty-odd scheduled jobs that I noticed the "entry point" of my own code had quietly split into four. Article prep called generateContent directly; the nightly bulk work went through Batch; the App Store review summaries ran on a homegrown agent loop; the image work lived in yet another helper. Each was the shortest path at the time. But the first thing I had to do, returning to one job after six months, was remember which entry point it had even used.
On June 30 the Interactions API became generally available, and the primary entry point for Gemini's models and agents settled here. Managed Agents, background execution, and Gemini Omni all line up under the same door. This reads less like a flashy feature and more like the quiet kind of update that pays off slowly over a long time. A single call surface means the version of me six months from now no longer has to remember where a call was coming from.
This article is not about writing one fresh script. It is a migration design for folding already-running, scattered call sites into one front door — without stopping them, without breaking them. It covers a working adapter layer, the order of migration, and the accidents that happen precisely because you are mid-migration.
What scattered entry points actually cost
More call sites is not painful at first. It hurts six months later, and it surfaces in three shapes.
The first is duplicated instrumentation. Token accounting, retry-on-failure, timeout handling — you end up writing each one slightly differently per entry point. In my case, the retry ceiling was three on the generateContent path and, on the Batch path, unset and therefore unbounded — a mismatch I found only later. It is the classic way a cost anomaly goes unnoticed for too long.
The second is that a model swap never ends in one place. When gemini-flash-latest became the body of 3.5 Flash, I had to change four entry points separately. Every time I fixed one, I doubted whether I had really fixed them all. That is not a problem of count; it is a problem of an unseeable blast radius.
The third is that you cannot easily adopt a new operating mode. Even when you want background execution — submit, then receive when done — scattered entry points make it hard to tell which path to rewrite first.
The essential benefit of consolidation is that all three become "fix it in one place." The Interactions API can be that receiver, but you do not need to move everything at once. Slipping a thin layer in between was the safest approach I found.
Put the front door in your adapter layer, not the API
Here is the judgment I most want to convey. Do not place the consolidated front door on the Interactions API itself. Place it on a thin adapter layer that you own.
The reason is simple: the API's details will keep changing. Just as the legacy outputs schema was removed on June 6, schemas and arguments move with their deprecation deadlines. If your application code grips those details directly, every such change means touching every job. With one adapter in between, only the inside of the adapter changes.
What the adapter offers is exactly one entry point. You hand it "what you want done," and a "result" comes back. Inside, it calls the Interactions API.
# llm_gateway.py — the single front door you own
import os
import time
import uuid
import logging
from dataclasses import dataclass, field
from typing import Any
from google import genai # the real call is confined to the inside of this module only
log = logging.getLogger("llm_gateway")
_client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
# Centralize model selection here. A swap is fixed in this one place.
MODEL_BY_TIER = {
"fast": "gemini-flash-latest", # prep, classification, light passes
"deep": "gemini-3-pro", # the real work that needs reasoning
}
@dataclass
class Request:
task: str # what you want done (the prompt body)
tier: str = "fast" # fast / deep
idempotency_key: str = field(default_factory=lambda: uuid.uuid4().hex)
background: bool = False # submit long work and receive later
metadata: dict[str, Any] = field(default_factory=dict)
@dataclass
class Result:
text: str
model: str
usage: dict[str, int]
idempotency_key: str
def run(req: Request, *, max_retries: int = 3) -> Result:
"""The single entry point every job passes through.
Instrumentation, retries, and model selection all converge here."""
model = MODEL_BY_TIER[req.tier]
started = time.monotonic()
last_err: Exception | None = None
for attempt in range(1, max_retries + 1):
try:
# The one and only API-dependent point. Even if details shift,
# nothing leaks outside this function.
resp = _client.interactions.create(
model=model,
input=req.task,
# A re-send with the same key prevents duplicate billing and execution
idempotency_key=req.idempotency_key,
background=req.background,
)
usage = {
"input": resp.usage.input_tokens,
"output": resp.usage.output_tokens,
}
_record(req, model, usage, time.monotonic() - started, attempt)
return Result(
text=resp.output_text,
model=model,
usage=usage,
idempotency_key=req.idempotency_key,
)
except Exception as err: # narrow the type in real use
last_err = err
wait = min(2 ** attempt, 30)
log.warning("run failed (attempt %d/%d): %s — retry in %ss",
attempt, max_retries, err, wait)
time.sleep(wait)
_record(req, model, {}, time.monotonic() - started, max_retries, failed=True)
raise RuntimeError("llm_gateway.run exhausted retries") from last_err
def _record(req, model, usage, elapsed, attempts, *, failed=False):
# Instrumentation in one place too. Cost rollups and latency monitoring
# are just a matter of reading this log line.
log.info("llm_call key=%s model=%s tier=%s in=%s out=%s elapsed=%.2f attempts=%d failed=%s job=%s",
req.idempotency_key, model, req.tier,
usage.get("input"), usage.get("output"),
elapsed, attempts, failed, req.metadata.get("job", "-"))Please note that the argument names of _client.interactions.create(...) are the spot to confirm against the docs at the time you adopt this. GA pushes arguments toward stability, but the very design of not scattering this across your app is your insurance against that uncertainty. Code outside the adapter knows only run(Request(...)).
The moment you place this adapter, the three pains above disappear. Instrumentation is the one _record. Model swaps are the one MODEL_BY_TIER. The retry ceiling is the one argument on run. None of them require hunting around anymore.