●CLI — As of Jun 18, Gemini CLI and the Gemini Code Assist IDE extensions stop serving AI Pro/Ultra and free individual users; Antigravity CLI is the successor●FLASH — The Gemini 3.5 series begins with 3.5 Flash, built for agents and coding with strength on long-horizon tasks●DEEPTHINK — Gemini 3 Deep Think is rolling out to Google AI Ultra as the top reasoning mode for math, science, and logic●APP — The Gemini app gains a Daily Brief, a redesigned interface, the Gemini Omni video model, and a personal agent called Gemini Spark●DESIGN — A new design language, Neural Expressive, rebuilds the experience for richer visuals and faster switching between modalities●ULTRA — Google AI Ultra bundles top model access, Deep Research, Veo 3 video, and a 1M-token context window●CLI — As of Jun 18, Gemini CLI and the Gemini Code Assist IDE extensions stop serving AI Pro/Ultra and free individual users; Antigravity CLI is the successor●FLASH — The Gemini 3.5 series begins with 3.5 Flash, built for agents and coding with strength on long-horizon tasks●DEEPTHINK — Gemini 3 Deep Think is rolling out to Google AI Ultra as the top reasoning mode for math, science, and logic●APP — The Gemini app gains a Daily Brief, a redesigned interface, the Gemini Omni video model, and a personal agent called Gemini Spark●DESIGN — A new design language, Neural Expressive, rebuilds the experience for richer visuals and faster switching between modalities●ULTRA — Google AI Ultra bundles top model access, Deep Research, Veo 3 video, and a 1M-token context window
Before You Let a Managed Agent Ship: Designing Your Own Acceptance Gate
Let the public-preview Managed Agents generate files and broken artifacts will flow straight into production. Here is how to build a verification gate that artifacts must pass before you accept them, with runnable Python and a rejection-feedback loop.
The first time the public-preview Managed Agents gave me a cold sweat was when an agent confidently handed back a broken artifact. Having Google's isolated Linux sandbox handle planning, reasoning, code execution, and file operations end to end is genuinely convenient. But the moment you drop that output into a production directory, you have left yourself without a single quality gatekeeper.
As an indie developer running content automation for four sites on my own, that missing gatekeeper was a real danger. I have learned the hard way that if thin output escapes even once, the whole site's standing slowly erodes. So before talking about making agents smarter, I want to share the design that matters more: a layer built specifically to not trust the agent's output.
Why a "smart agent" alone can't ship to production
The public-preview Managed Agents run statefully and will rewrite files for you. What is easy to miss is that the success status an agent returns only means "it self-reported that it finished the task." A file was indeed created inside the container. Whether it meets your requirements is a separate question entirely.
Here is what actually went wrong for me:
A generated article file was missing one required frontmatter field (the agent reported "done")
In another task, the artifact reused a paragraph verbatim from a previous job
An output JSON had a key name that differed slightly from last time, and the downstream build silently returned an empty array
None of these surface as agent-side errors. That is exactly why you have to place deterministic verification code on the receiving side — judgment logic a human wrote, independent of the agent's self-report.
The shape of the acceptance gate
The skeleton is simple. The agent is the "producer"; the gate is "inspection."
Run the Managed Agent inside the sandbox and have it generate an artifact
Pull the artifact file out of the sandbox — into quarantine/, not accepted/
Run it through the acceptance gate (schema, duplicates, required signals — all mechanical)
If it passes, move it to accepted/. If not, record a structured rejection reason
Return the rejection reason to the agent and have it rewrite the same task (feedback loop)
The key is that step 3 must not be delegated to another LLM. LLM-as-judge is useful as a secondary layer, but placing it at the primary gate just adds one more "clever but capricious inspector." The primary gate is always deterministic code.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦If you want Managed Agents to produce artifacts but their quality keeps drifting too much to ship, you will be able to build a layer that mechanically verifies agent output before accepting it
✦You get a runnable Python acceptance gate that pulls artifacts out of the sandbox and runs schema checks, verbatim-duplicate detection, and a practical-signal count
✦You will understand the rejection-feedback loop that makes the agent rewrite its work, plus the quarantine/accepted two-stage layout that stops automation from quietly degrading
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Here is a working gate assuming the artifact is an "article JSON." It checks three things: schema, required signals, and verbatim paragraph duplication.
# acceptance_gate.py — inspect agent artifacts (keep the primary gate deterministic)from __future__ import annotationsimport json, re, sys, hashlibfrom dataclasses import dataclass, fieldfrom pathlib import PathREQUIRED_FIELDS = ("title", "slug", "body", "tags")MIN_BODY_CHARS = 1200 # sanity floor for thin artifactsMIN_SIGNALS = 3 # minimum practical-utility signals@dataclassclass GateResult: accepted: bool reasons: list[str] = field(default_factory=list) def reject(self, msg: str) -> None: self.accepted = False self.reasons.append(msg)def _paragraphs(text: str) -> list[str]: return [p.strip() for p in re.split(r"\n\s*\n", text) if len(p.strip()) >= 80]def _practical_signals(body: str) -> int: """Count things that give the reader real value: code, numbers, steps, recommendations.""" signals = 0 if (chr(96) * 3) in body: signals += 1 # runnable code (code fence) if re.search(r"\d+\s?(ms|MB|%|req/|tok)", body): signals += 1 # concrete measurements if re.search(r"^\s*\d+\.\s", body, re.M): signals += 1 # numbered procedure if re.search(r"(I |in production|actually|first-hand)", body): signals += 1 # first-hand context return signalsdef check_artifact(path: Path, corpus_hashes: set[str]) -> GateResult: res = GateResult(accepted=True) try: data = json.loads(path.read_text(encoding="utf-8")) except json.JSONDecodeError as e: res.reject(f"Cannot parse as JSON: {e}") return res # 1) Schema: required fields present for key in REQUIRED_FIELDS: if not data.get(key): res.reject(f"Required field '{key}' is empty or missing") body = data.get("body", "") # 2) Sanity: reject artifacts that are too thin if len(body) < MIN_BODY_CHARS: res.reject(f"Body is {len(body)} chars, below the {MIN_BODY_CHARS} floor") # 3) Practical signals n = _practical_signals(body) if n < MIN_SIGNALS: res.reject(f"Only {n} practical signals (need at least {MIN_SIGNALS})") # 4) Verbatim duplication against the existing corpus for para in _paragraphs(body): h = hashlib.sha256(para.encode("utf-8")).hexdigest() if h in corpus_hashes: res.reject("Contains a paragraph that matches a previous artifact verbatim") break return resdef load_corpus_hashes(accepted_dir: Path) -> set[str]: hashes: set[str] = set() for f in accepted_dir.glob("*.json"): body = json.loads(f.read_text(encoding="utf-8")).get("body", "") for para in _paragraphs(body): hashes.add(hashlib.sha256(para.encode("utf-8")).hexdigest()) return hashesif __name__ == "__main__": artifact = Path(sys.argv[1]) accepted = Path(sys.argv[2]) if len(sys.argv) > 2 else Path("accepted") accepted.mkdir(exist_ok=True) result = check_artifact(artifact, load_corpus_hashes(accepted)) if result.accepted: print("OK accepted") sys.exit(0) print("rejected") for r in result.reasons: print(" -", r) sys.exit(1)
The deliberate choice here is to return rejection as a list of reasons, not a boolean. We will feed those reasons straight back into the agent's next instruction, so it is worth keeping "why it failed" as text. Tune thresholds like MIN_SIGNALS to your own content bar. I count four facets — measurements, runnable code, numbered steps, first-hand context — and pass anything with three or more.
Pulling artifacts out via quarantine
Wrap the agent run and artifact retrieval thinly. Since Managed Agents is in public preview and the API is still moving, the safe move is to box the agent-call into a single swappable point. Generated output always lands in quarantine/, never accepted/.
# run_and_gate.py — run agent -> quarantine -> gate -> acceptedimport shutilfrom pathlib import Pathfrom acceptance_gate import check_artifact, load_corpus_hashesQUARANTINE = Path("quarantine")ACCEPTED = Path("accepted")def run_managed_agent(task: str) -> Path: """ Start a Managed Agent and pull its artifact into quarantine. The public-preview API may change, so keep ONLY this function swappable. In production, box the agents.create -> run -> list_files -> download flow inside here and always return a Path under quarantine. """ QUARANTINE.mkdir(exist_ok=True) out = QUARANTINE / "artifact.json" # client.agents.run(model="antigravity-preview-05-2026", task=task, ...) # downloaded = client.agents.download(run_id, "out/article.json") # out.write_bytes(downloaded) return outdef accept_or_retry(task: str, max_attempts: int = 3) -> bool: ACCEPTED.mkdir(exist_ok=True) feedback = "" for attempt in range(1, max_attempts + 1): prompt = task if not feedback else f"{task}\n\nYour last output was rejected. Fix all of these:\n{feedback}" artifact = run_managed_agent(prompt) result = check_artifact(artifact, load_corpus_hashes(ACCEPTED)) if result.accepted: shutil.move(str(artifact), ACCEPTED / artifact.name) print(f"accepted on attempt {attempt}") return True feedback = "\n".join(f"- {r}" for r in result.reasons) print(f"rejected on attempt {attempt}:\n{feedback}") print("attempt limit reached; routing to human review") return False
Keeping quarantine/ and accepted/ physically separate means that whatever happens, you can see at a glance which artifacts have not yet entered production. I once tried to get by with a single directory and a file I thought I had rejected slipped into the next day's build. Since splitting them in two, that class of mix-up has been zero.
Getting the feedback loop right
The heart of accept_or_retry is that it concatenates the rejection reasons straight into the next instruction. Telling an agent to "try again" just repeats the same failure, but handing it "the required field 'tags' was empty" and "the body was too thin" noticeably raises the odds it passes on the second pass.
Always cap the attempts, though. I cut off at three and send anything beyond that to a human-review queue. Unbounded retries burn cost and sandbox runtime at the agent's whim. Managed Agents spin up a container, so a single run is heavier than a one-shot API call — worth keeping in mind.
Add LLM-as-judge as a secondary gate
Only for artifacts that already passed the deterministic primary gate, add a secondary LLM gate to assess the harder-to-measure qualities: natural prose, factual consistency. Reverse the order and it breaks — put the LLM check first and it will wave through obvious defects like a missing required field as "looks roughly fine."
# secondary gate (optional): call only AFTER the primary gate passesdef llm_review(body: str, client) -> tuple[bool, str]: rubric = ( "Inspect the following body. Flag template-y intros, repeated claims, " "and unsupported assertions. If there are none, reply with just OK." ) resp = client.models.generate_content( model="gemini-3.5-flash", contents=[rubric, body], ) verdict = resp.text.strip() return verdict.startswith("OK"), verdict
I use gemini-3.5-flash for the secondary gate because inspection is a task where speed and low cost pay off. Reserve a higher-tier model or Deep Think for the final, heavy-reasoning scoring, and Flash handles everyday inspection comfortably — that is my experience after running it.
A first step you can take tomorrow
Stop pointing the agent at your production directory and just set up two things first: quarantine/ and acceptance_gate.py. Start with loose thresholds and confirm that all of your past artifacts pass; that way you can adjust the gate without worrying it is over-rejecting. Making the agent smarter can wait until the inspection is working.
With automation, standing up the inspection before strengthening the producer is what gradually widens the range you can safely delegate. I am still tuning my own thresholds, but if you are working to fold autonomous agents into your own operations, I hope this helps.
Share
Thank You for Reading
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.