⟐ Dev Tools/2026-06-13Advanced

Getting Ready for Gemini in Chrome's Auto Browse — Structuring a Web App Agents Can Actually Operate

Before Gemini's auto browse reaches Android Chrome, here is how I reshaped my own web app so an agent can reliably operate it — pinning down action targets, the accessibility tree, JSON-LD, and guarding destructive actions, all with implementation code.

Gemini⁶⁶ Chrome² auto browse² accessibility² structured data² AI agents³

✦ Premium Article

Last month I asked a hands-on AI browser agent to operate the store flow on a small app-showcase site I run as an indie developer. The instruction was simple — "sort the popular wallpapers cheapest first, then add the top one to the cart" — something a human finishes in seconds. The agent couldn't open the sort dropdown, and stalled there two times out of three. The cause wasn't the model's intelligence. It was that the UI I had built myself offered no reliable "target" to click.

Gemini in Chrome's Android rollout announced at I/O 26 (late June, devices with 4GB+ RAM, starting from en-US) and its auto browse feature feel like the doorway to an era where this kind of automated operation runs routinely in ordinary users' hands. Here I want to record the specific places I changed to move my web app toward a structure that an agent can operate reliably, with before-and-after code.

Auto Browse Stalls on UIs Built for Eyes Only

A browser agent like auto browse ultimately sees the same screen a human does, but when it decides what to operate, it reads the accessibility tree first — the semantically annotated element tree the browser maintains internally. If that tree lacks the information "this is the sort control" or "this is the add-to-cart button," the agent is left guessing from coordinates and text alone, and a wrong guess means a missed action.

The sort dropdown I stalled on was a cluster of styled div elements. Visually it looked like a select box; in the accessibility tree it was a "meaningless box." Humans can parse it by sight, but the agent gets no handle. I've come to see this as the first place worth fixing for the auto browse era.

Pin the Action Targets — Accessible Names and Stable Hooks

The first thing I fixed was making sure operable targets can always be found by the same name and attributes. Before the refactor, decoration came first: buttons were icon-only and labels relied on tooltips.

Before:

<!-- Looks like a cart button, but to an agent it's an unnamed div -->
<div class="cart-icon" onclick="addToCart(123)">
  <svg>...</svg>
</div>
 
<!-- Sort: a custom implementation instead of a native select -->
<div class="sort-dropdown" data-open="false">
  <span>Sort</span>
  <ul class="options">
    <li onclick="sortBy('price-asc')">Price: low to high</li>
  </ul>
</div>

After, I gave them native elements, accessible names, and a stable hook attribute that never changes.

<!-- role and aria-label state "what button this is." data-action is a stable hook -->
<button
  type="button"
  aria-label="Add this wallpaper to cart"
  data-action="add-to-cart"
  data-product-id="123"
  onclick="addToCart(123)">
  <svg aria-hidden="true">...</svg>
  <span class="visually-hidden">Add to cart</span>
</button>
 
<!-- Sort goes back to a native select; the tree recognizes it as a combobox automatically -->
<label for="sort-order">Sort</label>
<select id="sort-order" data-action="sort-order" onchange="applySort(this.value)">
  <option value="popular">Most popular</option>
  <option value="price-asc">Price: low to high</option>
  <option value="price-desc">Price: high to low</option>
</select>

Three things matter here. First, a native select is far easier for an agent than a custom dropdown. Second, aria-label lets you name the function independently of the visual design. Third, a stable attribute like data-action means you can change class names for design reasons without breaking the agent's handle. Because I refactor class names often for visual reasons, I standardized on routing operable hooks to data-action.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦The refactor that lifted agent task completion from 2-of-5 to 5-of-5 by pinning action targets with data attributes and accessible names

✦Implementation code for making page intent machine-readable via the accessibility tree and JSON-LD (landmarks / Product schema)

✦A confirmation-gate pattern that prevents agents from auto-executing destructive actions like account deletion or checkout

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Make the Accessibility Tree the Agent's Map

Once individual targets are pinned, the next step is tidying the map of the whole page. Agents understand "where the major regions are" from landmarks. Before the refactor my page was nested div elements only — header, body, and search were indistinguishable.

<!-- After: declare regions with landmark roles -->
<header>
  <nav aria-label="Main navigation">...</nav>
</header>
<main>
  <section aria-labelledby="catalog-heading">
    <h1 id="catalog-heading">Wallpaper catalog</h1>
    <form role="search" aria-label="Search wallpapers">
      <input type="search" name="q" aria-label="Keyword" />
      <button type="submit">Search</button>
    </form>
    <ul aria-label="Search results">
      <li>...</li>
    </ul>
  </section>
</main>

With landmarks like main / nav / search, the agent reaches "the search form is here, the results list is here" in one hop. In my experience, after tidying this up the stability of compound instructions like "search, then open the third item from the top" rose visibly. I used to treat accessibility work as "for human assistive technology"; for the auto browse era I've revised that — it's also a map for agents.

State Page Intent Explicitly with JSON-LD

The tree is a map for operation, but if the agent misreads "what this page even is," it gets the premise wrong before operating at all. So I make page intent machine-readable with structured data (JSON-LD). For a product page, just placing Product hands over price, stock, and name as unambiguous decision material.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Daybreak Gradient Wallpaper",
  "image": "https://example.com/wallpapers/123.png",
  "sku": "WP-123",
  "offers": {
    "@type": "Offer",
    "price": "250",
    "priceCurrency": "JPY",
    "availability": "https://schema.org/InStock"
  }
}
</script>

One caution here. The price and stock you write in JSON-LD must match the actual values shown on screen. At first I hit a mismatch where cached static JSON-LD diverged from a price computed dynamically on the client, and the agent read the stale price. Treat structured data as the "source of truth for the screen," not its "shadow," and render both from the same data source — once I did, the drift vanished. In production I keep a small test that checks JSON-LD and the DOM display reference the same values.

The idea is the same beyond product pages. For an article page, Article; for an app showcase, SoftwareApplication makes it harder for the agent to confuse "is this a page to read, or a page to operate?" After I added SoftwareApplication to the showcase page of an app I distribute on the App Store, I felt the agent could pin down "which one is the download link" in a single pass. I revised my own assumption that structured data is only for SEO.

Keep State and URLs Deterministic

Agents get lost when premises shift mid-operation. A design where "filter conditions never reach the URL and live only in hidden screen state" is especially risky. Before the refactor, my catalog kept sort and filter only in JavaScript variables, so once the agent re-read the page, the conditions were gone.

After, I reflected all filter state into query parameters, making URLs shareable and reproducible.

// Always reflect sort/filter changes into the URL
function applySort(value) {
  const url = new URL(location.href);
  url.searchParams.set("sort", value);
  // Putting state in the URL preserves conditions even if the agent re-reads the page
  history.replaceState(null, "", url);
  renderCatalog(readStateFromUrl());
}
 
function readStateFromUrl() {
  const p = new URL(location.href).searchParams;
  return {
    sort: p.get("sort") ?? "popular",
    category: p.get("category") ?? "all",
  };
}

I was also mindful of idempotency — running the same operation twice shouldn't break the result. Making "add to cart" idempotent per data-product-id, for instance, means the quantity won't double if the agent re-clicks to confirm. Auto browse hesitates less than a human, but occasionally repeats the same action, so idempotency is a practical safety net.

Don't Let Agents Auto-Execute Destructive Actions

This is the part I value most as a design principle. An agent being able to operate something is a separate matter from whether everything should run automatically. Irreversible operations — deletion, checkout confirmation, account closure — I design so they never complete in one click, with an explicit human confirmation in between.

<!-- Destructive operations require a confirmation step; aria-describedby conveys intent -->
<button
  type="button"
  data-action="delete-account"
  aria-describedby="delete-warning"
  onclick="openConfirmDialog('delete-account')">
  Delete account
</button>
<p id="delete-warning" role="note">
  This action cannot be undone. The confirmation screen requires identity verification.
</p>

Requiring a two-stage confirmation in the dialog means that even if the agent mistakenly walks into a delete flow, the decision always returns to the human before the final commit. I draw the line here to reconcile "agents can move fast" with "humans hold the final say." In implementation, I list the data-action values of destructive operations and guarantee with unit tests that they can only fire through the confirmation dialog.

A Verification Harness Assuming an On-Device Agent

Finally, how I confirmed the refactor worked. I verified repeatedly with the following steps.

Open the tree in the browser's accessibility inspector and visually check that every element with data-action has an appropriate role and accessible name
Run the main flow (search → filter → detail → cart) through the agent five times each with the same natural-language instruction, and record the completion count
Cross-check with an automated test that the price and stock in JSON-LD match the screen display
Confirm with tests that destructive data-action values never fire without the confirmation dialog

Before the refactor, the instruction that ran the three steps filter → detail → cart completed only two times out of five. After pinning action targets, adding landmarks, and moving state into the URL, the same instruction completed five out of five. Completion rate is a crude metric, but it serves as a first approximation of "can a user's agent actually get things done on my site" once auto browse becomes common. On a store flow tied directly to AdMob revenue, I see that gap as the gap in missed conversions.

On where to start, what I recommend is the order "touch the flows tied directly to revenue or drop-off first." Trying to rebuild every page at once tends to drag your production design into the change and cause accidents, so narrowing to a single flow like purchase or sign-up is the safer way forward. I once nearly broke my layout by dropping data-action across every screen at once, and I've since settled on widening one flow at a time while verifying as I go. A staged migration like this makes it easier to avoid clashes between agent-readiness and design changes, and keeps the production impact predictable.

As a next step, pick the single most important flow you have, and start by giving its operable targets a data-action and an accessible name. Tidy even one screen and you should immediately feel the difference in how stably an agent behaves.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.