Autonomous browser QA with browser-use

Install a skill or read the field note below to see how we apply this pattern in real Claude Code projects.

verified 2 months ago28 min

Install the workflow

Manual QA catches issues, but it does not scale when UI changes land every day. Run this command to install a skill and start from a working baseline instead of rebuilding the setup from scratch.

$npx frenxt-cables add browser-use-qa

Did install work?

Files this command writes (1 file)

.claude/skills/qa/SKILL.md ← artifact/SKILL.md

Set the Browser Use/model keys required by your QA harness, commonly BROWSER_USE_API_KEY and GOOGLE_API_KEY, before running this skill.

Before you run it: BROWSER_USE_API_KEYGOOGLE_API_KEY

Autonomous browser QA with browser-use

Manual QA catches issues, but it does not scale when UI changes land every day.

What we tried

We set up browser-use inside a Python virtualenv to keep it isolated from the project's Node dependencies:

python -m venv .venv && source .venv/bin/activate
pip install browser-use
playwright install chromium

Then we invoked the browser QA skill from Claude Code, passing it three user journey descriptions. Not URLs. The skill takes natural-language flow descriptions as input, not a sitemap. Each description read like a short user story: "Sign in with test account, verify dashboard loads, sign out." We ran in headed mode first (BROWSER_USE_HEADLESS=false) so we could watch what the agent was actually doing and catch obvious prompt issues early. Once the flows were stable we switched to headless for CI.

Authentication was the trickiest part. We pre-seeded session cookies by logging in manually, exporting cookies to tests/fixtures/auth.json, and loading them at the start of each run. This meant the agent skipped the login wall and could focus on the flow itself without consuming tokens re-authenticating on every run.

We ran it against three flows: sign-in, checkout, and settings update.

What happened

Each run produced a structured failure note when something went wrong. A typical note looked like this:

Flow: checkout
Step failed: "Click 'Proceed to payment'"
Reason: Element matched but returned pointer-events: none
Screenshot: runs/2026-04-15/checkout-step-4.png
Reproduction: Load /cart with seed fixture, add item, proceed to checkout

That format made triage fast. We knew the exact step, the CSS property causing it, and had a screenshot to share.

False positives did appear. The most common cause was timing: the agent would attempt to click a button before a loading spinner resolved. We saw this on the checkout confirmation screen which delayed ~800 ms under test conditions. Retrying the step once eliminated most of them. We also saw a false positive on a modal that the agent expected to close but was instead replaced by a second confirmation dialog. The agent interpreted this as failure rather than a two-step flow.

Across three flows, a full run took roughly 4–6 minutes in headed mode and 2–3 minutes headless. Results were consistent run-to-run as long as the seed dataset and auth state were stable. With a flaky network or a cold server the variance was wider.

What we learned

Start with high-value user journeys, not full-site crawls. browser-use is slow relative to a unit test. Save it for the flows where a real user would feel the breakage.
Capture screenshots only on failure. Logging every step produces hundreds of images and slows the run. Failure-only screenshots stay actionable.
Keep one stable seed dataset. Non-deterministic test data (fresh UUIDs, random inventory) is the fastest way to introduce flakiness. One fixture file per flow, committed to the repo.
Run browser-use on staging before deploy, not on every commit. AI-driven sessions cost more (time and tokens) than a Playwright assertion. Use them for pre-deploy checks or scheduled overnight runs, not as a commit gate.
Handle auth state explicitly. Letting the agent log in fresh each run wastes tokens and introduces a second point of failure. Pre-seeded cookies or a test-account session token are both fine. Just make the auth step declarative rather than implicit.
Understand where browser-use ends and Playwright begins. browser-use is AI-driven and takes natural-language flow descriptions. It handles ambiguity well but is slower and less precise. Playwright is code-driven with explicit selectors and assertions. Faster, cheaper, and deterministic. Use browser-use to discover what to test; write Playwright tests to lock in what you've found.

When this doesn't fit

Unit-testable logic: browser-use adds overhead for logic you can verify with a function call. Use it only for flows requiring real UI interaction.
High-frequency CI runs: AI-driven browser sessions are slower and more expensive than Playwright assertions. Use browser-use for scheduled or pre-deploy checks, not every commit.
Performance testing: browser-use measures functional correctness, not load time or Core Web Vitals. Use Lighthouse or k6 for those.
Deeply nested interaction sequences: flows with more than 8–10 steps tend to accumulate errors. Break complex journeys into smaller checkpoints.

Next: Your first subagent.

Quick answers

What do I get from this cable?

You get a skill plus a dated field note that explains how we use it in real Claude Code workflows.

How much time should I budget?

Typical effort is 28 min. The cable is marked intermediate.

How do I install the artifact?

Run npx frenxt-cables add browser-use-qa. The install block shows the files it writes and any prerequisites before you run it.

How fresh is the guidance?

The cable is explicitly last verified on 2026-04-15, and includes source links for traceability.

Work with FRE|Nxt

We build the production AI systems we write about.

Cables are the field notes. The playbooks come from client engagements — multi-agent systems, RAG pipelines, and LLM cost cuts that ship and hold up in production. If something here maps to a problem on your roadmap, two ways in:

Get a free 2-page audit Book a 30-min discovery call

Audit capacity: 5 slots/month · No pitch deck · NDA on request

Same shelf · Fix a specific problem

claude-code·no artifact

Use auto mode, not --dangerously-skip-permissions

Two flags promise to stop Claude Code from pausing at every tool call. One of them reads your settings, honours your allowlist, and refuses to run anything g…

@frenxt · 8 mininstall →

claude-code·skill

Publish your stack to Cables (automated)

A skill that walks Claude Code through publishing your Claude stack to the Cables community in one conversation. No manual repo setup, no hand-written `stack…

@frenxt · 10 mininstall →

claude-code·skill

Replicate Ragav's stack (skills + plugins + scripts)

Pick the stack that matches what you're building. Each one is a single `npx` command. Plugins installed, skills synced, marketplaces configured, no bash scri…

@frenxt · 15 mininstall →

Share this cable

Share on Twitter Share on LinkedIn

Autonomous browser QA with browser-use

Autonomous browser QA with browser-use

What we tried

What happened

What we learned

When this doesn't fit

Next

Quick answers

We build the production AI systems we write about.

Use auto mode, not --dangerously-skip-permissions

Publish your stack to Cables (automated)

Replicate Ragav's stack (skills + plugins + scripts)