◈ API / SDK/2026-06-16Advanced

Before You Let a Managed Agent Ship: Designing Your Own Acceptance Gate

Let the public-preview Managed Agents generate files and broken artifacts will flow straight into production. Here is how to build a verification gate that artifacts must pass before you accept them, with runnable Python and a rejection-feedback loop.

gemini-api²³⁹ managed-agents² production¹¹³ automation³⁸ quality-gate agent-design

✦ Premium Article

The first time the public-preview Managed Agents gave me a cold sweat was when an agent confidently handed back a broken artifact. Having Google's isolated Linux sandbox handle planning, reasoning, code execution, and file operations end to end is genuinely convenient. But the moment you drop that output into a production directory, you have left yourself without a single quality gatekeeper.

As an indie developer running content automation for four sites on my own, that missing gatekeeper was a real danger. I have learned the hard way that if thin output escapes even once, the whole site's standing slowly erodes. So before talking about making agents smarter, I want to share the design that matters more: a layer built specifically to not trust the agent's output.

Why a "smart agent" alone can't ship to production

The public-preview Managed Agents run statefully and will rewrite files for you. What is easy to miss is that the success status an agent returns only means "it self-reported that it finished the task." A file was indeed created inside the container. Whether it meets your requirements is a separate question entirely.

Here is what actually went wrong for me:

A generated article file was missing one required frontmatter field (the agent reported "done")
In another task, the artifact reused a paragraph verbatim from a previous job
An output JSON had a key name that differed slightly from last time, and the downstream build silently returned an empty array

None of these surface as agent-side errors. That is exactly why you have to place deterministic verification code on the receiving side — judgment logic a human wrote, independent of the agent's self-report.

The shape of the acceptance gate

The skeleton is simple. The agent is the "producer"; the gate is "inspection."

Run the Managed Agent inside the sandbox and have it generate an artifact
Pull the artifact file out of the sandbox — into quarantine/, not accepted/
Run it through the acceptance gate (schema, duplicates, required signals — all mechanical)
If it passes, move it to accepted/. If not, record a structured rejection reason
Return the rejection reason to the agent and have it rewrite the same task (feedback loop)

The key is that step 3 must not be delegated to another LLM. LLM-as-judge is useful as a secondary layer, but placing it at the primary gate just adds one more "clever but capricious inspector." The primary gate is always deterministic code.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If you want Managed Agents to produce artifacts but their quality keeps drifting too much to ship, you will be able to build a layer that mechanically verifies agent output before accepting it

✦You get a runnable Python acceptance gate that pulls artifacts out of the sandbox and runs schema checks, verbatim-duplicate detection, and a practical-signal count

✦You will understand the rejection-feedback loop that makes the agent rewrite its work, plus the quarantine/accepted two-stage layout that stops automation from quietly degrading

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

A gate you can run as-is

Here is a working gate assuming the artifact is an "article JSON." It checks three things: schema, required signals, and verbatim paragraph duplication.

# acceptance_gate.py — inspect agent artifacts (keep the primary gate deterministic)
from __future__ import annotations
import json, re, sys, hashlib
from dataclasses import dataclass, field
from pathlib import Path
 
REQUIRED_FIELDS = ("title", "slug", "body", "tags")
MIN_BODY_CHARS = 1200          # sanity floor for thin artifacts
MIN_SIGNALS = 3                # minimum practical-utility signals
 
@dataclass
class GateResult:
    accepted: bool
    reasons: list[str] = field(default_factory=list)
 
    def reject(self, msg: str) -> None:
        self.accepted = False
        self.reasons.append(msg)
 
def _paragraphs(text: str) -> list[str]:
    return [p.strip() for p in re.split(r"\n\s*\n", text) if len(p.strip()) >= 80]
 
def _practical_signals(body: str) -> int:
    """Count things that give the reader real value: code, numbers, steps, recommendations."""
    signals = 0
    if (chr(96) * 3) in body:                       signals += 1   # runnable code (code fence)
    if re.search(r"\d+\s?(ms|MB|%|req/|tok)", body): signals += 1   # concrete measurements
    if re.search(r"^\s*\d+\.\s", body, re.M):        signals += 1   # numbered procedure
    if re.search(r"(I |in production|actually|first-hand)", body): signals += 1  # first-hand context
    return signals
 
def check_artifact(path: Path, corpus_hashes: set[str]) -> GateResult:
    res = GateResult(accepted=True)
    try:
        data = json.loads(path.read_text(encoding="utf-8"))
    except json.JSONDecodeError as e:
        res.reject(f"Cannot parse as JSON: {e}")
        return res
 
    # 1) Schema: required fields present
    for key in REQUIRED_FIELDS:
        if not data.get(key):
            res.reject(f"Required field '{key}' is empty or missing")
 
    body = data.get("body", "")
    # 2) Sanity: reject artifacts that are too thin
    if len(body) < MIN_BODY_CHARS:
        res.reject(f"Body is {len(body)} chars, below the {MIN_BODY_CHARS} floor")
 
    # 3) Practical signals
    n = _practical_signals(body)
    if n < MIN_SIGNALS:
        res.reject(f"Only {n} practical signals (need at least {MIN_SIGNALS})")
 
    # 4) Verbatim duplication against the existing corpus
    for para in _paragraphs(body):
        h = hashlib.sha256(para.encode("utf-8")).hexdigest()
        if h in corpus_hashes:
            res.reject("Contains a paragraph that matches a previous artifact verbatim")
            break
    return res
 
def load_corpus_hashes(accepted_dir: Path) -> set[str]:
    hashes: set[str] = set()
    for f in accepted_dir.glob("*.json"):
        body = json.loads(f.read_text(encoding="utf-8")).get("body", "")
        for para in _paragraphs(body):
            hashes.add(hashlib.sha256(para.encode("utf-8")).hexdigest())
    return hashes
 
if __name__ == "__main__":
    artifact = Path(sys.argv[1])
    accepted = Path(sys.argv[2]) if len(sys.argv) > 2 else Path("accepted")
    accepted.mkdir(exist_ok=True)
    result = check_artifact(artifact, load_corpus_hashes(accepted))
    if result.accepted:
        print("OK accepted")
        sys.exit(0)
    print("rejected")
    for r in result.reasons:
        print("  -", r)
    sys.exit(1)

The deliberate choice here is to return rejection as a list of reasons, not a boolean. We will feed those reasons straight back into the agent's next instruction, so it is worth keeping "why it failed" as text. Tune thresholds like MIN_SIGNALS to your own content bar. I count four facets — measurements, runnable code, numbered steps, first-hand context — and pass anything with three or more.

Pulling artifacts out via quarantine

Wrap the agent run and artifact retrieval thinly. Since Managed Agents is in public preview and the API is still moving, the safe move is to box the agent-call into a single swappable point. Generated output always lands in quarantine/, never accepted/.

# run_and_gate.py — run agent -> quarantine -> gate -> accepted
import shutil
from pathlib import Path
from acceptance_gate import check_artifact, load_corpus_hashes
 
QUARANTINE = Path("quarantine")
ACCEPTED = Path("accepted")
 
def run_managed_agent(task: str) -> Path:
    """
    Start a Managed Agent and pull its artifact into quarantine.
    The public-preview API may change, so keep ONLY this function swappable.
    In production, box the agents.create -> run -> list_files -> download
    flow inside here and always return a Path under quarantine.
    """
    QUARANTINE.mkdir(exist_ok=True)
    out = QUARANTINE / "artifact.json"
    # client.agents.run(model="antigravity-preview-05-2026", task=task, ...)
    # downloaded = client.agents.download(run_id, "out/article.json")
    # out.write_bytes(downloaded)
    return out
 
def accept_or_retry(task: str, max_attempts: int = 3) -> bool:
    ACCEPTED.mkdir(exist_ok=True)
    feedback = ""
    for attempt in range(1, max_attempts + 1):
        prompt = task if not feedback else f"{task}\n\nYour last output was rejected. Fix all of these:\n{feedback}"
        artifact = run_managed_agent(prompt)
        result = check_artifact(artifact, load_corpus_hashes(ACCEPTED))
        if result.accepted:
            shutil.move(str(artifact), ACCEPTED / artifact.name)
            print(f"accepted on attempt {attempt}")
            return True
        feedback = "\n".join(f"- {r}" for r in result.reasons)
        print(f"rejected on attempt {attempt}:\n{feedback}")
    print("attempt limit reached; routing to human review")
    return False

Keeping quarantine/ and accepted/ physically separate means that whatever happens, you can see at a glance which artifacts have not yet entered production. I once tried to get by with a single directory and a file I thought I had rejected slipped into the next day's build. Since splitting them in two, that class of mix-up has been zero.

Getting the feedback loop right

The heart of accept_or_retry is that it concatenates the rejection reasons straight into the next instruction. Telling an agent to "try again" just repeats the same failure, but handing it "the required field 'tags' was empty" and "the body was too thin" noticeably raises the odds it passes on the second pass.

Always cap the attempts, though. I cut off at three and send anything beyond that to a human-review queue. Unbounded retries burn cost and sandbox runtime at the agent's whim. Managed Agents spin up a container, so a single run is heavier than a one-shot API call — worth keeping in mind.

Add LLM-as-judge as a secondary gate

Only for artifacts that already passed the deterministic primary gate, add a secondary LLM gate to assess the harder-to-measure qualities: natural prose, factual consistency. Reverse the order and it breaks — put the LLM check first and it will wave through obvious defects like a missing required field as "looks roughly fine."

# secondary gate (optional): call only AFTER the primary gate passes
def llm_review(body: str, client) -> tuple[bool, str]:
    rubric = (
        "Inspect the following body. Flag template-y intros, repeated claims, "
        "and unsupported assertions. If there are none, reply with just OK."
    )
    resp = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=[rubric, body],
    )
    verdict = resp.text.strip()
    return verdict.startswith("OK"), verdict

I use gemini-3.5-flash for the secondary gate because inspection is a task where speed and low cost pay off. Reserve a higher-tier model or Deep Think for the final, heavy-reasoning scoring, and Flash handles everyday inspection comfortably — that is my experience after running it.

A first step you can take tomorrow

Stop pointing the agent at your production directory and just set up two things first: quarantine/ and acceptance_gate.py. Start with loose thresholds and confirm that all of your past artifacts pass; that way you can adjust the gate without worrying it is over-rejecting. Making the agent smarter can wait until the inspection is working.

With automation, standing up the inspection before strengthening the producer is what gradually widens the range you can safely delegate. I am still tuning my own thresholds, but if you are working to fold autonomous agents into your own operations, I hope this helps.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.