Skip to content

Vercel's agent-browser: Why a CLI Beats MCP for Browser Automation

10 min read
Bart Waardenburg

Bart Waardenburg

AI Agent Readiness Expert & Founder

Vercel's agent-browser hit 22,000 GitHub stars in two months. It's a browser automation tool built specifically for AI agents, written in Rust, connecting directly to Chrome via CDP. But the interesting part isn't the star count or the language choice. It's what they decided not to build: an MCP server.

In a world where every tool is shipping MCP integrations, Vercel went the other direction. They built a CLI. A plain, boring, Unix-philosophy CLI that takes commands and prints text to stdout. And the performance data makes it hard to argue with that decision.

The Context Window Tax

Browser automation via MCP has a token problem. According to community benchmarks, when Playwright MCP starts a session, it loads an estimated ~13,700 tokens of tool definitions into the agent's context window before a single page is visited. Chrome DevTools MCP is reportedly worse at ~17,000 tokens. That's roughly 9% of a 200K context window consumed before the agent does anything useful.

Then there's the per-action cost. In one empirical test by Engin Diri at Pulumi, a single Playwright MCP click response weighed in at 12,891 characters. The same click via agent-browser? Six characters: Done. Page complexity matters, but the order-of-magnitude difference is consistent.

PLAYWRIGHT MCP CLICK RESPONSE
12,891 chars
AGENT-BROWSER CLICK RESPONSE
6 chars
PLAYWRIGHT MCP STARTUP TOKENS
~13,700
AGENT-BROWSER STARTUP TOKENS
0

Over a real workflow this compounds fast. Third-party benchmarks have estimated the token cost of a 10-step automation flow across three tools:

Tool Tokens (10-step flow) Reduction vs Playwright MCP
Playwright MCP ~114,000 Baseline
Chrome DevTools MCP ~50,000 56%
agent-browser (CLI) ~7,000 94%

94% token reduction. That's the difference between an agent that runs out of context after three pages and one that completes an entire multi-step workflow with headroom to spare.

How agent-browser Works

The architecture has two main layers. A Rust CLI handles argument parsing in sub-millisecond time. Behind it runs a native Rust daemon that communicates with Chrome directly via CDP (Chrome DevTools Protocol), no Node.js or Playwright required. The daemon stays warm between commands via Unix domain sockets, eliminating browser startup costs. You can also swap Chrome for Lightpanda, a Zig-based headless browser that claims roughly 10x faster performance and 10x less memory in its own benchmarks.

Rust CLI

Native binary. Argument parsing in sub-millisecond time. Communicates with the daemon via Unix domain sockets (TCP on Windows).

Native Rust Daemon

Long-running process talking to Chrome via CDP directly. First command ~500ms to spawn. Every command after that: sub-100ms. No Node.js or Playwright needed.

Browser (CDP)

Chrome via Chrome DevTools Protocol. Also supports Lightpanda, remote Chrome instances, and cloud browsers like Browserbase.

The first command in a session takes about 500ms to start the daemon and launch the browser. After that, each command completes in sub-100ms. Compare that to Playwright MCP, where every action involves a full JSON-RPC round-trip with tool schemas, parameters, and verbose response objects.

Snapshots and Refs: The Core Innovation

The design choice that makes everything work is the snapshot + refs system. Instead of CSS selectors, XPath, or full DOM dumps, agent-browser captures the page's accessibility tree , filters it to interactive elements, and assigns each one a stable reference.

agent-browser snapshot -i plain
button "Sign In" [ref=e1]
textbox "Email" [ref=e2]
textbox "Password" [ref=e3]
link "Forgot password?" [ref=e4]
link "Create account" [ref=e5]

Five elements, five refs. The agent says agent-browser click @e1 to click Sign In, or agent-browser fill @e2 "user@example.com" to type an email address. No CSS selectors. No pixel coordinates. No vision model parsing a screenshot. Refs resolve by matching accessibility role and name via CDP's Accessibility API, conceptually the same approach as Playwright's getByRole .

For comparison, here's what Playwright MCP returns for a similar page:

Playwright MCP browser_snapshot plain
- heading "Welcome back" [level=2]
  - paragraph: "Sign in to your account"
- form:
  - textbox "Email" [required] (ref=s1e4)
  - textbox "Password" [required] (ref=s1e5)
  - button "Sign in" (ref=s1e6)
- navigation:
  - link "Forgot password?" (ref=s1e7)
  - link "Create account" (ref=s1e8)
  - link "Terms of Service" (ref=s1e9)
  - link "Privacy Policy" (ref=s1e10)
  ... (continues for entire page)

Playwright MCP dumps the full accessibility tree. Every heading, paragraph, landmark, and decoration on the page. agent-browser's -i flag (interactive only) strips it to just the elements you can actually click, type into, or toggle. In Engin Diri's testing, a homepage snapshot was ~280 characters with agent-browser versus ~8,247 with Playwright MCP. Your numbers will vary by page complexity, but the pattern holds.

AGENT-BROWSER HOMEPAGE SNAPSHOT
~280 chars
PLAYWRIGHT MCP HOMEPAGE SNAPSHOT
~8,247 chars

The Accessibility Tree, Again

If you've read our piece on how AI agents see your website , this should feel familiar. The accessibility tree is the browser-generated simplified view of your page: roles, names, states, descriptions. Originally built for screen readers, now the primary interface for AI agents across every major framework.

agent-browser makes this connection explicit. The snapshot command is an accessibility tree query. When an agent calls agent-browser snapshot -i, it gets the same tree that screen readers consume and that Playwright tests query with getByRole . The only difference is the output format: compact text instead of YAML.

The implications for website owners haven't changed. If your button is a <div onclick> instead of a <button>, it won't appear in the snapshot. If your form fields lack labels, the agent sees textbox "" and has no idea what to type. If your navigation is built with unsemantic divs, the agent can't find your pages. Same rules as before, but with a tool that now has 22,000+ developers paying attention.

The annotated screenshot feature makes this even more tangible. Running agent-browser screenshot --annotate overlays numbered labels on every interactive element, each label mapped to a ref. It's a visual debugger for your accessibility tree. Multimodal AI models can reason about layout while still using the same deterministic refs for interaction. Text-based and visual approaches finally converge on the same element identifiers.

Less Is More: The Counterintuitive Data

The design philosophy behind agent-browser echoes findings from Vercel's own D0 text-to-SQL agent research. Both projects share the same parent company and the same core insight: less tooling, better reasoning. The D0 results are genuinely surprising. They tested two architectures: one with 17 specialized tools, and one with just 2 general-purpose tools.

17-TOOL ARCHITECTURE
0
2-TOOL ARCHITECTURE
0

The 17-tool version: 80% success rate, 274.8 seconds execution time, ~102,000 tokens consumed. The 2-tool version: 100% success rate, 77.4 seconds, ~61,000 tokens. Fewer tools, higher success, faster execution, lower cost. Every single metric improved.

As Vercel's D0 team put it: "We were constraining reasoning because we didn't trust the model to reason." Giving agents 17 specific tools forced them to pick the "right" one at each step. Giving them 2 flexible tools let them figure out the best approach themselves. agent-browser applies the same principle to the browser: minimal output, minimal tool surface, maximum reasoning headroom.

Metric 17 Tools 2 Tools Improvement
Success rate 80% 100% +25%
Execution time 274.8s 77.4s 3.5x faster
Token consumption ~102,000 ~61,000 37% fewer
Steps required Baseline 42% fewer Simpler paths

This aligns with what Andrej Karpathy argued about CLIs being the ideal agent interface . CLIs are inherently simple. Flags, stdin, stdout. No tool schema negotiation, no capability discovery dance. The agent runs a command and reads the output. agent-browser took that philosophy and applied it to the browser: 50+ commands, but each one does exactly one thing and returns the minimum output needed.

What This Means for Your Website

agent-browser is gaining adoption fast. It works with Claude Code, Cursor, GitHub Copilot, OpenAI Codex, Google Gemini, and any other agent that can run shell commands. No MCP configuration needed. Just npm install -g agent-browser. That means more AI agents are going to browse your site via the accessibility tree, whether you optimized for it or not.

The same patterns that make your site work with Playwright MCP apply here. Semantic HTML, proper form labels, accessible names on buttons and links, a clean heading hierarchy. But agent-browser makes the feedback loop even tighter. When an agent runs snapshot -i on your homepage and gets two refs instead of twelve, you know exactly where the problem is.

Agents can see it

Semantic buttons, labeled form fields, proper nav landmarks, descriptive link text, ARIA where HTML falls short. All of this shows up in snapshot output and gets a ref.

Agents are blind to it

Div-soup with onclick handlers, icon buttons without labels, placeholder-only inputs, links that say 'click here', JavaScript-only rendering without SSR. None of this gets a ref.

Our scanner measures exactly these signals. Every checkpoint in the Content & Semantics category maps directly to what agent-browser can parse: semantic HTML (3.3), heading hierarchy (3.2), alt text (3.5), ARIA usage (3.4), form labels (4.6), descriptive links (3.7), and SSR detection (3.1). A high score on those checkpoints correlates directly with a richer accessibility tree, and therefore more refs for agent-browser to work with.

agent-browser ships with an AGENTS.md file and a skills/ directory that AI coding agents consume directly to learn its command set. It also supports multi-session isolation with separate sockets, cookies, and ref caches per session, so agents can operate in parallel without interfering with each other. With multiple releases per day and hundreds of open issues, this project is moving fast. And it's pulling the browser automation space toward a clear conclusion: the accessibility tree is the interface, and simplicity beats feature count.

Sources

Ready to check?

SCAN YOUR WEBSITE

Get your AI agent readiness score with actionable recommendations across 5 categories.

  • Free instant scan with letter grade
  • 5 categories, 47 checkpoints
  • Code examples for every recommendation

RELATED ARTICLES

Continue reading about AI agent readiness and web optimization.

Playwright: From Test Runner to AI Agent Interface
11 min read

Playwright: From Test Runner to AI Agent Interface

Playwright overtook Cypress, then Microsoft shipped Playwright MCP — turning the same tool into the standard browser runtime for AI agents. We break down why the data-testid vs getByRole debate now determines whether agents can use your site, the testing-accessibility-agent flywheel, and what this means for frontend teams.

ai-agents web-standards accessibility
How AI Agents See Your Website: The Accessibility Tree Explained
12 min read

How AI Agents See Your Website: The Accessibility Tree Explained

AI agents don't see your website the way humans do. They navigate via the accessibility tree — a browser-generated structure originally built for screen readers. We explain how it works, which AI frameworks use it, and why accessible websites outperform in the age of AI agents.

ai-agents web-standards accessibility
What Is agents.json? Advertising AI Agent Capabilities on Your Website
10 min read

What Is agents.json? Advertising AI Agent Capabilities on Your Website

agents.json is the emerging complement to robots.txt - a machine-readable file that tells AI agents what your website can do. We cover the Wildcard specification, compare it to A2A, MCP, and OpenAPI, and show you how to implement it step by step.

ai-agents web-standards agent-protocols

EXPLORE MORE

Most websites score under 45. Find out where you stand.

RANKINGS
SEE HOW OTHERS SCORE

RANKINGS

Browse AI readiness scores for scanned websites.
COMPARE
HEAD TO HEAD

COMPARE

Compare two websites side-by-side across all 5 categories and 47 checkpoints.
ABOUT
HOW WE MEASURE

ABOUT

Learn about our 5-category scoring methodology.