Back to Case Studies

Open-sourcing our browser QA agent after proving it across 480 production specs

We built QA Agent to run markdown browser tests like a human QA operator, capture evidence, and turn release QA into an operating system instead of a spreadsheet graveyard. After using it heavily in TheBlueOne and InterviewLM, we packaged the core into an open-source product.

At A Glance

Product

QA Agent Open Source

Core stack

Python, browser-use, Playwright, OpenRouter, Linear, Sentry

Best-fit runtime model

Gemini 3.1 Flash Lite

Operational proof

TheBlueOne and InterviewLM

Production Specs

480

338 in TheBlueOne plus 142 in InterviewLM as inspected on April 16, 2026.

Archived Runs

289

178 Juliet report directories and 111 InterviewLM result directories.

Largest Archived Sweep

107 tests

One Juliet release summary recorded 74 passes and surfaced 33 failures before release.

Best-Fit Model

Gemini 3.1 Flash Lite

Our operating sweet spot for browser QA: fast enough to sweep broad suites without losing task fidelity.

Manual QA was slowing releases exactly where the products were moving fastest

Both downstream products had fast-changing surfaces that are awkward for brittle scripted test suites: dynamic dashboards, auth-heavy flows, voice experiences, AI-generated UI states, pricing changes, onboarding branches, and release sweeps defined in spreadsheets. The real constraint was not whether we could automate one happy path. It was whether the team could keep up with change without losing evidence.

Markdown-first specs

Tests stay readable by product, QA, and engineering. The parser turns markdown into structured steps, expectations, personas, and optional mobile viewports.

Release matrix ingestion

CSV release plans can be synced into markdown test cases, tagged as automated, semi-automated, or manual, then executed as one release pipeline.

Evidence-rich reports

Every run can emit screenshots, GIFs, video, conversations, HTML summaries, and machine-readable JSON for downstream reporting.

Ops integrations

Linear issue creation, Sentry correlation, persona refresh, and optional Supabase uploads turn test runs into something the team can actually act on.

Two products, two different QA problems, one shared testing substrate

This is the part that made open-sourcing credible. We were not packaging a demo. We were extracting the reusable kernel from systems already doing serious work inside two production codebases.

TheBlueOne

Embedded `apps/qa-agent` deployment supporting large release QA, UAT sweeps, auth checks, pricing and billing validation, and builder workflows.

338 markdown specs9 suite folders178 archived report directories

InterviewLM

Embedded `apps/qa-agent` deployment covering smoke, auth, session-flow, stage-types, results UI, PMF checks, voice flows, screenshots, and demo capture.

142 markdown specs17 suite folders111 archived result directories

The release-QA proof point

One archived Juliet release summary from March 15, 2026 recorded 107 total tests, 74 passes, and 33 failures. That is exactly the kind of evidence trail we wanted: large enough to matter, structured enough to route into engineering, and visual enough to debug quickly.

The product got stronger once we stopped treating QA as a script-writing problem

The core design move was simple: keep test authoring legible, keep execution agentic, and keep outputs operational. That combination is what let the same system scale from smoke tests to release sweeps.

The tests are written in product language instead of selector language, so they survive UI change better and can be reviewed by non-engineers.

Persona-aware auth avoids the usual dead end where AI browser tests spend their whole budget logging in or fighting anti-bot friction.

Artifact capture makes failures legible. Engineers do not just get a red status; they get visual evidence, agent reasoning, and run context.

Release mode accepts the reality that not every check should be fully automated. Manual and semi-automated cases still live in the same operating system.

CLI workflow

`run`, `report`, `auth`, and `release` cover the common QA loop without shipping a giant platform.

Model-agnostic core

The starter stays configurable, but our production operating guidance points hard at Gemini 3.1 Flash Lite for this job.

Release-compatible

CSV import exists because real release managers still work from matrices, not just code-owned test folders.

The QA evidence ships with the narrative

We ran the vendored QA Agent directly against the live `frenxt` UI on April 16, 2026. The primary suite run `2026-04-16T21-27-46` passed 4 of 4 checks in 188 seconds across the homepage, case study, research index, and article detail pages. A second proof-focused run `2026-04-16T21-31-57` passed 3 of 3 checks in 107 seconds and produced the GIF artifacts shown below.

Primary UI Run

4 / 4 passed

Run `2026-04-16T21-27-46` validated the homepage, case study, research index, and the Gemini article detail page.

Proof Capture Run

3 / 3 passed

Run `2026-04-16T21-31-57` generated the motion-proof assets used for publication review.

Verified screenshot of the QA Agent open-source case study page in frenxt
Clean page snapshot from the verified case study route after the UI suite passed.
GIF proof from the QA Agent validation run of the case study page
Motion proof captured from `apps/qa-agent/reports/2026-04-16T21-31-57` during the proof-focused validation run.

Gemini 3.1 Flash Lite became the best fit for this workflow

That claim is based on operating experience, not generic benchmark chest-thumping. TheBlueOne config and InterviewLM runner both converged on Gemini 3.1 Flash Lite because browser QA values a very specific blend of speed, cost discipline, and instruction adherence.

Why it held up in production

Fast enough for parallel browser sessions without turning every regression sweep into an overnight job.

Why it held up in production

Cheap enough to justify broad smoke and release coverage instead of rationing runs to a tiny critical path.

Why it held up in production

Reliable enough to follow multi-step markdown instructions, use QA tools, and produce crisp PASS or FAIL verdicts.

Why it held up in production

Already proven in both TheBlueOne and InterviewLM configurations, where the runtime converged on `google/gemini-3.1-flash-lite-preview` for day-to-day execution.

QA Agent is now a marketable product, not just an internal convenience

The open-source package matters because it compresses a real operating pattern: author tests in markdown, execute them with a browser agent, capture enough evidence to debug, and keep release QA tied to the tools teams already use. That is a much stronger story than “we built another testing wrapper.”