Skip to content

Browsers that AI agents can drive

Published: at 06:17 AM

The browser is the last mile of the internet. Nearly everything useful lives behind one: filing insurance claims, pulling reports from vendor portals, checking inventory on a supplier’s website that hasn’t been updated since 2009. For years we automated these tasks with brittle scripts — Selenium, Puppeteer, Playwright — that broke the moment a site moved a button three pixels.

A new category of tools has emerged that takes a different approach. Instead of writing CSS selectors that break whenever a site changes, you give an AI agent a browser and tell it what to do in plain language. The agent reads the page, reasons about it, and acts. Some of these tools are infrastructure. Some are frameworks. One is an entirely new browser. Together they form an ecosystem worth understanding.

What you can build with them

Before diving into specific tools, here’s what this category makes possible.

Data extraction at scale. Plenty of useful data lives on websites without APIs. Real estate listings, government filings, product catalogs, academic citations. An agent with a browser can navigate pagination, handle login walls, and extract structured data from pages that were never designed to be machine-readable.

Form-filling workflows. Insurance claims. Benefits applications. Procurement requests. Compliance filings. Today a human copies data from one system, opens a browser, navigates to a portal, and types it into a form. An agent can do this end-to-end.

Testing and QA. Traditional end-to-end tests break when the UI changes. An agent-driven test can be written as “log in, add an item to the cart, check out, and verify the confirmation page” — and it keeps working even when the checkout flow gets redesigned.

Competitive intelligence. Monitor competitor pricing, track product launches, watch for regulatory filings. These tasks happen on a schedule and span multiple websites. Pair a browser agent with a cron job and a database, and you have a monitoring system that would have taken a team of analysts to maintain.

Keyword monitoring. Google Alerts tracks mentions across what Google indexes. Paid tools like Mention, Brand24, Brandwatch, Sprout Social, and Syften go deeper into social media and forums. But all of them have blind spots — pages they can’t reach, communities they don’t cover. A browser agent can fill those gaps, scanning Reddit threads, Hacker News, niche forums, review sites, and any other corner of the web on a schedule. No API or RSS feed required.

Lead and data enrichment. Sales teams use waterfall enrichment tools like Clay, Clearbit, and Apollo to run a lead through multiple data providers in sequence until they get a match. A browser agent can automate the manual version of this — pull a LinkedIn profile, check a company website, cross-reference a directory, fill in the gaps. Especially powerful for sources that don’t have APIs or charge heavily for access.

Multi-site account management. Change your business hours and you’re logging into Google Business Profile, Yelp, Apple Maps, Bing Places, and a half-dozen industry directories to update each one. At month-end, you’re pulling invoices and receipts from Stripe, AWS, Google Workspace, Slack, and every other SaaS tool your company pays for. A browser agent can log into each platform, perform the task, and aggregate the results. Freelancers juggling profiles across Upwork, Fiverr, and LinkedIn have the same problem with fewer resources — exactly the people who benefit most from automation.

Research agents. Give an agent a question and let it search the web, follow links, read pages, and synthesize what it finds. This goes beyond simple search — the agent can navigate complex sites, drill into subpages, and cross-reference sources.

RFP automation. In 2023, Wired reported that Twilio, Google Cloud, IBM, and DataRobot had each built custom internal tools to automate responses to requests for proposals — AI that digests an RFP, searches internal documents, and drafts answers. Each required dedicated engineering teams and months of work. Today, a browser agent could monitor procurement portals and government bidding sites for new RFPs, download them, and feed them into an AI pipeline for response — a workflow that took enterprise teams to build in 2023, assembled with off-the-shelf tools in 2026.

Automated job applications. Build a web app where a user uploads their resume, tailors it to a role, and watches an agent submit it to job boards in real time — all inside your interface. The agent navigates each site, fills out the application forms, and handles the variations between Indeed, LinkedIn, and company career pages. Tools like Browserbase, Browser Use, Steel, Kernel, Skyvern, and Notte let you embed the controlled browser directly in your app, so the user can see every step and take over if something needs human input.

If any of these problems sound familiar, read on.

The foundations

Before the ecosystem of specialized tools, there were the foundation models themselves. These deserve mention because everything else builds on them — or competes with them.

Anthropic Computer Use is the broadest capability: Claude can perceive screenshots, move a cursor, click, and type across any application — browsers, desktop apps, terminals. It’s available via the Anthropic API, Amazon Bedrock, and Google Vertex AI. Many of the tools on this list use Computer Use under the hood.

OpenAI Operator takes a similar approach through its Computer-Using Agent (CUA), which combines GPT-4o’s vision with reinforcement learning. It runs in a sandboxed remote browser and is integrated directly into ChatGPT.

Amazon Nova Act is AWS’s entry, built for reliability over autonomy. It takes a hybrid approach: AI handles understanding and decisions, Playwright handles deterministic actions like password entry. Full AWS integration means IAM for credentials, S3 for storage, and Bedrock for orchestration.

How do you compare these? The industry uses standardized tests. WebVoyager gives an agent real tasks on real websites — book a flight, find a restaurant, fill out a form — and measures how often it succeeds. WebArena does the same but in a controlled environment with self-hosted web apps. ScreenSpot tests whether an agent can accurately identify and click the right element on a page. These benchmarks matter because they measure what users actually care about: can the agent finish the job? Operator scores 87% on WebVoyager. Nova Act scores 94% on ScreenSpot. The numbers shift with every model release, but they give you a rough sense of how reliable each option is today.

These models decide what to do. The tools below give them a browser to do it in.

The landscape

Playwright MCP

Playwright MCP is Microsoft’s official MCP server for Playwright, and it has become the default way most coding agents interact with browsers. With roughly 30,000 GitHub stars, it’s the most widely integrated browser MCP server available.

It operates in two modes. The default reads pages through the browser’s accessibility tree — the same structure that screen readers use — which is fast, cheap, and needs no vision model. But it can also take screenshots, letting the agent see the page visually. Claude Code uses both: accessibility snapshots for structured interaction, screenshots when it needs to see layout, styling, or visual bugs.

If you use Claude Code for web projects, Playwright MCP is essential — pair it with Superpowers and Claude Code becomes a full-stack agent that can read, write, and see your application. Point it at a local dev server to debug rendering issues. Have it navigate a site and extract data. Use it to verify that a deploy looks right. It turns the agent from a code-only assistant into one that can see and interact with what the code produces.

It works with Claude Code, VS Code, Cursor, Windsurf, Gemini CLI, and most other MCP clients. You can run it headless (no visible window — the browser runs in the background) or headed (a real browser window opens on your screen). In headed mode you watch the agent click, scroll, and type in real time — cursor moving, pages loading, forms filling themselves out. It’s one of those moments where the abstraction drops away and you see what “AI agent” actually means. You can also connect it to a persistent browser profile or attach it to an already-running Chrome instance via extension. It can also generate Playwright test code from the agent’s actions — useful for turning exploratory automation into repeatable scripts.

Browser Use

Browser Use is the most popular open-source option, with over 84,000 GitHub stars and a $17M seed round from Felicis. The open-source library is Python (pip install browser-use), but the cloud platform has a TypeScript SDK (npm install browser-use-sdk) and a REST API you can call with curl or any language. Hand it a task in natural language — something like “go to Hacker News, find the top 5 posts about AI, and give me the titles and URLs.” It browses, extracts data, and returns structured results.

What sets it apart: breadth. The open-source library handles simple automation. The cloud platform adds stealth browsers with CAPTCHA solving, residential proxies in 195+ countries, and fingerprint spoofing. They’ve also trained custom LLMs specifically for browser tasks, which tend to be cheaper and faster than general-purpose models. Their Skill API lets you turn any website interaction into a reusable API endpoint — define the workflow once, call it forever. The cloud platform also returns a live URL for each session — embed it as an iframe to let users watch the agent work inside your app, with theming and the option to hide browser chrome. Strong for data extraction and competitive monitoring at scale.

Browserbase

Browserbase doesn’t automate browsers. It is browsers — cloud-hosted, serverless, spun up by the thousand. Think of it as the infrastructure layer that other tools build on top of.

If you already use Playwright or Puppeteer, you connect to Browserbase through the same APIs you know. Each browser runs in an isolated instance with its own resources. The Live View feature lets a human watch what the agent is doing in real time and take over if needed — and it’s embeddable as an iframe in your own web app, so you can build products where end users see the agent working inside your interface. Browser Use, Kernel, Steel, Skyvern, and Notte all offer similar capabilities. SOC-2 and HIPAA compliant.

Stagehand

Stagehand is a developer SDK that Browserbase built on top of its own infrastructure. You write code that describes what you want, and the AI figures out how to do it in the browser at runtime. It boils browser automation down to three methods: act(), extract(), and agent.execute().

In traditional Playwright, you’d write something like page.locator('#price-element').textContent() — which breaks the moment that element ID changes. In Stagehand, you write:

await stagehand.extract("the price of the first item");

The AI reads the page, finds the price regardless of how the HTML is structured, and returns it. act() works the same way for interactions — await stagehand.act("add the first item to the cart") clicks the right button without you specifying which one. agent.execute() chains these into multi-step tasks: hand it a research question and it will search, follow links, and synthesize what it finds.

For developers who want to write code that’s smarter than brittle selectors but more controlled than handing everything to a fully autonomous agent, this is a good place to start.

Kernel

Kernel is browser infrastructure that emphasizes speed. Backed by Y Combinator, trusted by teams at Cash App and Framer, it handles over 880,000 browser sessions per month for some customers.

Two things distinguish it. First, Kernel works directly with Anthropic on evaluating computer-use models — which means their infrastructure is shaped by cutting-edge agent capabilities. Second, they’ve partnered with 1Password to build authentication for agents, solving one of the thorniest problems in browser automation: how does a bot log in safely? This makes Kernel a natural fit for multi-site account management, where an agent needs to log into dozens of platforms and pull invoices or reports. Like Browserbase, Kernel offers embeddable live view — drop the session URL into an iframe, optionally in read-only or kiosk mode — plus session replays for post-hoc review.

Skyvern

Skyvern uses computer vision instead of DOM selectors. Rather than parsing HTML to find a button, it looks at the page the way a person would — visually. This makes it resilient to redesigns. Vision-based agents keep working because they recognize the purpose of elements, not their position in the markup.

It ships with a no-code workflow builder, which means non-engineers can create automations by describing what they want. You can define a single workflow and apply it across different websites — write one “update store hours” automation and run it against Google Business Profile, Yelp, and Apple Maps even though each site looks completely different. The vision-based approach also makes it well suited for testing and QA, where UI changes routinely break traditional test suites. Skyvern also offers livestreaming and observability — watch the browser in real time, replay session recordings, and inspect step-by-step action logs with screenshots.

Lux

Lux is different from everything else on this list. It’s not a browser, not infrastructure, not a framework — it’s an AI model, built specifically for controlling computers. Where Claude, GPT, and Gemini are general-purpose models that can drive a browser, Lux was trained from the ground up to do nothing else. Developed by researchers from MIT, CMU, and UIUC, it claims the best score on the Online-Mind2Web benchmark.

Why does this matter? Cost and speed. Lux runs at roughly one second per step and costs about 10x less than using a frontier model for the same task. You’d pair it with one of the browser infrastructure tools on this list — say, Browserbase or Kernel — and use Lux as the model that decides what to click, what to type, and where to navigate. If you’re running thousands of browser sessions for data extraction or form-filling, the savings add up fast. Think of it as swapping out an expensive general contractor for a specialist who only does the one job you need.

CloudCruise

CloudCruise is the only vertical product on this list. It builds browser automation exclusively for healthcare — specifically for payer portals and electronic health record systems.

Anyone in healthcare IT knows the pain. Insurance portals are slow, arcane, and change without warning. CloudCruise’s automations are self-healing: when a portal redesigns, the agent adapts. For revenue cycle teams spending hours on form-filling and data entry across dozens of payer websites, it pays for itself quickly.

Agent Browser Protocol

Agent Browser Protocol (ABP) solves the same problem as Playwright MCP — giving an agent a browser to control — but takes a fundamentally different approach. Playwright MCP wraps a standard browser from the outside, communicating through Chrome DevTools Protocol. ABP is a fork of Chromium itself, with the agent protocol built directly into the browser engine. No translation layer, no external coordination.

The key insight is what they call the “step machine.” Each API call is one atomic step: inject input, wait for the page to settle, capture a screenshot, return an event log, then freeze JavaScript until the next step. This eliminates the race conditions that plague wrapper-based approaches. The agent never has to wonder whether the page has finished loading.

Because the protocol lives inside the engine, ABP uses 2x fewer tokens, runs 2x faster, and requires 2x fewer tool calls compared to Playwright MCP. The tradeoff: you’re running a custom browser instead of stock Chromium. Setup is a single command: npx -y agent-browser-protocol --mcp. It works out of the box with Claude Code, Codex CLI, and any MCP client.

Steel

Steel is the open-source alternative to Browserbase — cloud browser infrastructure you can self-host. It has over 6,400 GitHub stars and deploys with a single click on Railway or via Docker.

It handles the same problems as Browserbase: session management, anti-bot protection, CAPTCHA solving, proxy rotation. But you own the deployment. Sessions start in under a second and can persist for up to 24 hours with saved cookies, extensions, and credentials. Like Browserbase and Browser Use, Steel also supports embeddable live sessions — you can stream the browser via WebRTC and let users interact directly through clicks, scrolling, and form input. If you need cloud browser infrastructure but want full control over the deployment, Steel is the answer.

Notte

Notte is a YC-backed platform from Switzerland that takes a different angle on cost. Its core idea is a “perception layer” that converts web pages into structured natural-language descriptions before the AI ever sees them. This means the model processes clean, pre-digested text instead of raw DOM or screenshots.

The practical effect is that you can script the deterministic parts of a workflow and only invoke the AI when judgment is needed — which Notte claims cuts costs by 50% or more. It’s compatible with Playwright, Puppeteer, Selenium, Browser Use, and Stagehand, so it layers on top of tools you might already use. Notte also supports live view and session replays — watch sessions as they execute, share debug URLs with teammates, or review recordings after the fact.

Crawl4AI

Crawl4AI is not a browser agent in the traditional sense — it’s a web crawler optimized for feeding content to LLMs. But with 51,000+ GitHub stars and an agentic crawler mode, it belongs in this conversation.

Where browser agents excel at doing things on websites (clicking, filling forms, navigating), Crawl4AI excels at reading them. It converts web content into clean Markdown, flattens shadow DOMs, strips consent popups, and handles anti-bot detection with automatic proxy escalation. It’s 4x faster than comparable tools and completely free under the Apache 2.0 license. If your agent needs to understand the web rather than interact with it, start here.

Firecrawl occupies similar territory but as a commercial API. You send it a URL, it returns clean Markdown — no infrastructure to manage. It can crawl entire sites (sitemaps, pagination, 10,000+ pages) with a single API call. Where it edges beyond pure crawling is its /interact endpoint, which handles clicking, form-filling, and dynamic content extraction. It also ships an MCP server and an agent endpoint for autonomous multi-step research. Think of Crawl4AI as the self-hosted library and Firecrawl as the managed service.

The consumer browsers

A parallel trend worth noting: full browsers with agents built in. ChatGPT Atlas is OpenAI’s Chromium-based browser with an Agent Mode that can browse, click, and fill forms on your behalf using your logged-in sessions. Perplexity Comet takes a search-first approach, with an AI assistant that reads pages across tabs and can complete purchases. These are consumer products, not developer tools. Whether AI-in-the-browser sticks or turns out to be another feature shoved in to see what happens remains to be seen.

The email gap

One task that browser agents handle poorly is email. Automating a workflow that involves sending a confirmation email, waiting for a reply, and acting on the response requires a different kind of infrastructure.

AgentMail fills this gap. It gives AI agents their own email inboxes — complete with sending, receiving, threads, attachments, and webhooks. The free tier includes 3 inboxes and 3,000 emails per month. It supports DKIM, SPF, and DMARC for deliverability, and offers an MCP server so it plugs into the same agent tooling ecosystem as the browser products.

If you’re building a workflow that involves both browsing and email — scraping invoice data from a vendor portal and emailing a summary to accounting — you’d combine a browser tool with AgentMail.

How to choose

The right tool depends on your problem.

If you want a ready-made solution, Browser Use’s cloud platform or Skyvern’s no-code builder will get you running without writing code.

If you want a developer framework, Stagehand gives you the cleanest API. Three methods, natural language, built on Playwright.

If you need cloud browser infrastructure to run agents at scale, Browserbase and Kernel are the managed options — Browserbase for observability, Kernel for speed. Steel gives you the same capabilities self-hosted.

If you care about cost and speed at the model layer, Lux offers a purpose-built alternative to using frontier models for browser tasks.

If you want the standard MCP approach, Playwright MCP is the most widely supported and costs nothing to run.

If you want maximum control and don’t mind running your own browser, Agent Browser Protocol gives you a deterministic, race-condition-free execution environment that nothing else can match.

If you mostly need to read and extract web content rather than interact with it, Crawl4AI is fast, free, and purpose-built for the job.

And if you work in healthcare, CloudCruise has already solved your specific problem.

Where this is heading

The ecosystem is layering fast. Infrastructure first (Browserbase, Kernel, Steel). Then frameworks (Stagehand, Browser Use, Playwright MCP). Then specialized models (Lux). Then vertical applications (CloudCruise). Then consumer products (Atlas, Comet).

Google is already working on the next step. WebMCP is a proposed web standard, available in Chrome Canary behind a flag, that lets websites declare their capabilities directly to agents. Instead of an agent figuring out how to navigate a site, the site tells the agent what it can do. If adopted broadly, it would make much of the current scraping-and-clicking approach unnecessary — at least for participating sites.

Meanwhile, browser control is becoming a standard feature in general-purpose AI agents. OpenClaw, which crossed 100,000 GitHub stars in early 2026, is a personal AI assistant that can drive a browser as just one of its many skills — alongside shell commands, APIs, iMessage, and file management. It’s not a browser tool. It’s an everything tool that happens to need a browser. That’s telling.

The browser was built for humans. Now agents are learning to use it on our behalf — cutting through the cookie banners, the paginated results, the forms designed for patience we don’t have.

But there’s an uncomfortable question underneath all of this. The web runs on ads. Agents that bypass pages to extract answers also bypass the revenue that pays for those pages. This isn’t hypothetical — major tech publications have lost 58% of their organic traffic since 2024, with some like Digital Trends and ZDNet down over 90%. Agents that never load an ad take that further. If the content creators don’t get paid, the content dries up and the agents have nothing to read.

Two kinds of payment infrastructure are emerging to address this — and they solve different problems.

The first is agent commerce: agents buying things on behalf of humans. Google’s Agent Payments Protocol (AP2) launched with 60+ partners including Mastercard, PayPal, and Walmart. OpenAI and Stripe built ACP, which powers instant checkout in ChatGPT. These protocols let an agent book a flight or purchase a product with your authorization.

The second is machine-to-machine payments: an agent paying a small fee to access an API, a webpage, or a dataset. x402 — governed by Coinbase and Cloudflare — uses the HTTP 402 “Payment Required” status code to enable stablecoin micropayments baked into web requests. Cloudflare’s Agents SDK provides tooling for developers to integrate it. Skyfire goes further with agent identity verification — a “Know Your Agent” system that lets AI agents establish trust and make payments without human intermediaries.

The machine payments side is more relevant to the content question. If an agent could pay a fraction of a cent per page instead of loading ads, publishers would have a new revenue stream. But micropayments on the web have been a hard sell for decades, and early signs suggest demand for x402 isn’t there yet. The honest answer is that nobody has figured this part out, and the tools are shipping faster than the business models.