◈ API / SDK/2026-06-21Advanced

Should You Move Your Agent Loop to Gemini's Managed Agents? Three Questions That Decide What Migrates

With Gemini API's Managed Agents in public preview, deciding between a self-hosted agent loop and a Google-hosted sandbox is now a real question. Three questions — execution environment, state ownership, and failure recovery — decide what migrates and what stays.

gemini-api²⁷⁸ managed-agents⁴ ai-agents² automation⁵² architecture¹⁶ google-io-2026

✦ Premium Article

Managed Agents, announced at Google I/O 2026, are now available in public preview on the Gemini API. The pitch is appealing: a single API call spins up an agent inside a Google-hosted, isolated Linux sandbox, where it reasons, calls tools, executes code, and hands you back the result.

I run a fair amount of automation as an indie developer — scheduled blog maintenance, image-asset housekeeping, and similar background jobs — all driven by agent loops I wrote and operate myself. My honest first reaction to the announcement was an even split of hope and suspicion. Hope, because if something else can carry the tedious parts of running a loop, I will gladly let it. Suspicion, because moving automation that already works is one of the more reliable ways to hurt yourself.

So I went through the jobs running on my machines, one by one, asking a single question: could this move to Managed Agents, and should it? The short answer is that not everything made the cut — but the reasoning collapsed neatly into three questions. What follows is that working-through, written down.

Your agent loop is mostly not the loop

The agent loop itself is surprisingly little code. Call the model; if it returns a function call, run the matching function; feed the result back; repeat. The skeleton fits in about thirty lines.

Here is a minimal loop with exactly one tool, a release-notes checker. Gemini calls the tool when it needs to, and the loop ends once a final text report comes back.

from google import genai
from google.genai import types
 
client = genai.Client()  # reads GEMINI_API_KEY from the environment
 
def check_release_notes(product: str) -> dict:
    """Returns the latest release-notes entry (a real version would hit an RSS feed or DB)."""
    return {"product": product, "latest": "1.4.2", "breaking_changes": False}
 
tool = types.Tool(function_declarations=[
    types.FunctionDeclaration(
        name="check_release_notes",
        description="Fetches the latest release-notes entry for a product name",
        parameters=types.Schema(
            type=types.Type.OBJECT,
            properties={"product": types.Schema(type=types.Type.STRING)},
            required=["product"],
        ),
    )
])
 
contents = [types.Content(
    role="user",
    parts=[types.Part(text="Check whether the latest release of dependency foo contains breaking changes, and report in one paragraph")],
)]
 
for _ in range(5):  # hard cap to prevent runaway loops
    response = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=contents,
        config=types.GenerateContentConfig(tools=[tool]),
    )
    if not response.function_calls:
        print(response.text)
        break
    contents.append(response.candidates[0].content)
    for call in response.function_calls:
        result = check_release_notes(**call.args)
        contents.append(types.Content(
            role="user",
            parts=[types.Part.from_function_response(name=call.name, response=result)],
        ))

I kept the skeleton deliberately bare to make a point: in production, everything that matters accretes around it. Retries with exponential backoff. Logging and persistence of intermediate progress. Keeping the execution environment alive — cron, containers, whatever you use. Credential management. Timeouts and protection against overlapping runs. In my own codebase, the operational layer around the loop is considerably larger than anything related to the agent's actual thinking.

I covered that production scaffolding in detail in Custom Gemini API Agent Loop Without ADK — A Complete Production Guide to Tool Calling, Memory, and Parallel Execution, but the one-line summary is: the loop is easy, the operations are the product. That is the baseline for everything below.

Which part of that operational layer do Managed Agents actually absorb?

Reading through what Google has published, the scope of Managed Agents looks like this. They provision and tear down the execution environment — the isolated Linux sandbox. They drive the loop itself: reason, execute tools and code, continue, repeat. And they hold the agent's state while it runs. In other words, out of the operational layer I just called "the product," the environment upkeep and the loop orchestration move wholesale to the other side of the API.

What stays on your side is just as clear. Defining the task — what you actually want done. Receiving the result and judging whether it is correct. Handling failure. Watching cost. You can delegate how the agent runs; you cannot delegate why it runs or what happens with what it produces.

Looking at that boundary, what struck me was not the freedom from cron management. It was the fact that verification and failure handling stay with me — because in my experience, the time sink in operating automation has never been keeping environments alive. It is investigating things when they break. That realization is why I stopped asking "is this convenient?" and started asking the three questions below.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Wrapping the Managed Agents call in a thin wrapper that auto-falls back to your self-hosted loop on failure

✦An idempotency-key gatekeeper that makes reruns and double-submits safe — including the sort_keys pitfall

✦A nightly cost-metering design tracking the managed/self time ratio and fallback rate to catch preview billing creep early

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Question 1 — Does the execution environment itself carry meaning?

The first question is where the job actually needs to run.

A good number of my jobs read and write files in a local workspace: walking a wallpaper app's image-asset folder to classify new pieces, cloning a repository to push articles, appending run logs to a known location. For these, the fact that they run on infrastructure I control is the point. Moving them into a sandbox means redesigning credential handling and file transfer, and by my estimate the migration cost exceeds the payoff.

The opposite kind of job — pass input in, investigate or transform, return output — runs the same anywhere. Monitoring public release notes. Researching and summarizing information from the web. Transforming data into reports. These fit the Managed Agents sandbox naturally, and for transformations that involve code execution, the isolation is an upgrade in its own right. Running model-generated code directly on your own machine is, frankly, something I have never been fully relaxed about.

Question 2 — Who owns the canonical copy of state?

The second question is about ownership of the agent's state.

Managed Agents support stateful agents, and having the sandbox carry conversational and working context across a long multi-step task is a real advantage. But I am not ready to hand the canonical copy of any state to a service in public preview. Preview APIs change. We just lived through an example: the Interactions API removed its legacy schema, forcing a migration from outputs to steps on short notice.

So my rule ended up simple. If state is ephemeral — meaningful only while the task runs — let the sandbox hold it. If state is an asset — referenced by future runs or by other jobs — the canonical copy lives in my own database or files, and the agent receives only what it needs, per run. Hold that line, and even if the Managed Agents surface shifts underneath you, the blast radius is one in-flight task.

Question 3 — When it fails, who can pick up the pieces, and at what granularity?

The third question was the decisive one for me: recoverability on failure.

A self-hosted loop fails at a granularity you can see. Which tool call, with which arguments, returned what before things stopped — you can trace as deeply as you bothered to log, and you are free to build resume-from-midpoint logic. During Gemini's major outage earlier this month, my pipelines survived on a staged retreat: retry, fall back, and if all else failed, write a log entry and exit quietly. That was only possible because I owned a hook at every stage of the loop.

Move a job into Managed Agents and your granularity is whatever the API exposes. How much of the agent's internal steps will become observable over time is something I genuinely hope improves, but as a design assumption, "you cannot intervene in intermediate state" is the safe stance today. Which narrows the candidates to jobs that can simply be rerun from scratch — idempotent work. Anything that emits side effects as it goes, where a half-finished run leaves value or risk behind, stays in the self-hosted loop where I hold the recovery tools.

One related note: the more "submit and wait" workloads you adopt — and Managed Agents are exactly that shape — the more your completion-notification design matters. I wrote about replacing polling with event-driven completion handling in Retiring the Midnight Polling Loop — Rebuilding My Gemini Batch Monitoring Around Webhooks, which pairs well with this discussion.

A small habit for keeping preview services next to production

Even for jobs that pass all three questions, I keep one thin layer of protection in place while Managed Agents remain in preview. Nothing elaborate: the Managed Agents call lives inside a small wrapper function with exactly two responsibilities. First, a fallback — if the call fails, the same task runs through the existing self-hosted loop. Second, schema validation on whatever comes back.

This buys two things. When preview-stage breaking changes arrive, the affected surface is one wrapper file. And because the wrapper can route the same task to either implementation, comparing output quality between Managed Agents and my own loop happens in one place. On cost, I also keep preview-period spending tagged separately from everything else, so any surprising billing pattern shows up early.

Implementation — the thin wrapper with fallback and schema validation

I said the wrapper carries exactly two responsibilities; here is what that looks like in code. Call Managed Agents, validate the result with Pydantic, and if either the call or the validation fails, route the same task through the self-hosted loop. That is the whole thing.

The key idea is that the fallback is not there to swallow exceptions — it is there to satisfy the same contract with a different implementation. Make both paths return one shared type, and the caller never has to care which one ran.

from pydantic import BaseModel
 
class ReleaseReport(BaseModel):
    product: str
    summary: str
    has_breaking_change: bool
 
def run_via_managed_agent(task: str) -> ReleaseReport:
    """Runs the task on Managed Agents (public preview), then validates the result."""
    resp = client.agents.run(  # preview-stage: this call name may still change
        model="gemini-3.5-flash",
        instructions=task,
        response_schema=ReleaseReport,
    )
    return ReleaseReport.model_validate_json(resp.output_text)
 
def run_via_self_hosted(task: str) -> ReleaseReport:
    """The fallback that satisfies the same contract via the self-hosted loop."""
    text = run_existing_agent_loop(task)  # the production version of the minimal loop above
    return ReleaseReport.model_validate_json(text)
 
def run_task(task: str) -> tuple[ReleaseReport, str]:
    """Managed Agents first, self-hosted loop if it fails. Returns which path ran."""
    try:
        return run_via_managed_agent(task), "managed"
    except Exception as e:  # cast a wide net during preview — call and validation failures alike
        log.warning("managed agent failed (%s); falling back to self-hosted", e)
        return run_via_self_hosted(task), "self_hosted"

With this shape, a preview-stage breaking change means editing exactly one function, run_via_managed_agent. Neither the callers of run_task nor anything else that consumes a ReleaseReport has to change. Returning "which path ran" alongside the result is deliberate — the next two sections, cost metering and quality comparison, both use that flag.

Making "just rerun it" safe with an idempotency key

Question 3 concluded that only idempotent jobs should move. But saying "this is idempotent" in your head is different from building a state where rerunning is actually safe. Because Managed Agents give you no way to intervene in intermediate state on failure, a fallback or a manual retry will quite normally run the same task twice. So I add a thin gatekeeper: derive one key from the task inputs, and if a result for that key already exists, skip the body.

import hashlib, json, sqlite3
 
db = sqlite3.connect("agent_runs.db")
db.execute("CREATE TABLE IF NOT EXISTS runs(key TEXT PRIMARY KEY, result TEXT)")
 
def idempotency_key(task: str, inputs: dict) -> str:
    """Normalizes task and inputs into a key (sort_keys guards against ordering drift)."""
    payload = json.dumps({"task": task, "inputs": inputs}, sort_keys=True, ensure_ascii=False)
    return hashlib.sha256(payload.encode("utf-8")).hexdigest()
 
def run_idempotent(task: str, inputs: dict) -> ReleaseReport:
    key = idempotency_key(task, inputs)
    row = db.execute("SELECT result FROM runs WHERE key = ?", (key,)).fetchone()
    if row:
        return ReleaseReport.model_validate_json(row[0])  # already done; the body never runs
    report, _ = run_task(f"{task}\ntarget: {json.dumps(inputs, ensure_ascii=False)}")
    db.execute("INSERT OR REPLACE INTO runs(key, result) VALUES (?, ?)",
               (key, report.model_dump_json()))
    db.commit()
    return report

What matters here is how the key is built. Drop sort_keys=True and the same input produces a different key whenever the dict ordering shifts — and your idempotency quietly breaks. I did exactly that once in a separate caching job and spent half a day hunting for why the hit rate was lower than expected. Keep two rules — never mix a timestamp or random value into the input, and normalize the JSON — and reruns fail safe. The Question 3 worry about emitting side effects through Managed Agents is, in practice, largely defused by this gatekeeper.

Tagging cost per job to watch preview-period billing

In preview, the scariest failure is not a broken feature — it is spend creeping up unnoticed. Because a single Managed Agents call drives several steps behind the scenes, token consumption is harder to read than in a self-hosted loop. So I record a per-task measurement alongside the "which path ran" flag that run_task returns.

import time
 
def metered(task: str, inputs: dict) -> ReleaseReport:
    started = time.monotonic()
    report = run_idempotent(task, inputs)
    db.execute(
        "INSERT INTO meter(tag, impl, seconds, ts) VALUES (?, ?, ?, ?)",
        (inputs.get("tag", "untagged"), "managed", time.monotonic() - started, time.time()),
    )
    db.commit()
    return report

I aggregate this table overnight and look, every day, at how many times more cost and time a Managed Agents task takes versus the self-hosted path. The guide below is the set of signals I watched when I started running the two in parallel.

What to watch	Why	Warning sign
managed / self time ratio	Measures the fixed cost of sandbox startup	Ratio blows out on short tasks
Calls per task	Indirectly tracks internal step count	Same input spikes on some days
Fallback rate	Gauges preview stability	A sudden rise suggests a spec change

Watching these three separately, you catch the preview-specific "somehow slower, somehow pricier" early. When you evaluate a service whose billing you cannot yet predict right next to production, building this measurement footing first — before chasing the convenience — turns out to be the cheaper path.

Start by picking the one sandbox-shaped job

When the dust settled, my first migration candidate was "monitor public release notes and summarize." It does not depend on my environment (question 1), its state is disposable every run (question 2), and a failed run can simply be repeated (question 3). It passes all three questions cleanly — almost a textbook case.

If you operate your own agent loops or scheduled jobs, the next step I would suggest is unglamorous: write the three answers next to each job on your list. This is not an all-or-nothing migration. Find the one job that clearly belongs in a sandbox, run it in parallel with your existing loop, and let the comparison teach you the rest. Evaluating new infrastructure without breaking working automation is, in the end, mostly about restraint.

I hope this helps if you are weighing the same decision about where your automation should live.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.