Open Source Productization
Open-sourcing our browser QA agent after proving it across 480 production specs
We built QA Agent to run markdown browser tests like a human QA operator, capture evidence, and turn release QA into an operating system instead of a spreadsheet graveyard. After using it heavily in TheBlueOne and InterviewLM, we packaged the core into an open-source product.
At A Glance
Product
QA Agent Open Source
Core stack
Python, browser-use, Playwright, OpenRouter, Linear, Sentry
Best-fit runtime model
Gemini 3.1 Flash Lite
Operational proof
TheBlueOne and InterviewLM
Repository
github.com/frenxt/qa-agentProduction Specs
480
338 in TheBlueOne plus 142 in InterviewLM as inspected on April 16, 2026.
Archived Runs
289
178 Juliet report directories and 111 InterviewLM result directories.
Largest Archived Sweep
107 tests
One Juliet release summary recorded 74 passes and surfaced 33 failures before release.
Best-Fit Model
Gemini 3.1 Flash Lite
Our operating sweet spot for browser QA: fast enough to sweep broad suites without losing task fidelity.
The Problem
Manual QA was slowing releases exactly where the products were moving fastest
Both downstream products had fast-changing surfaces that are awkward for brittle scripted test suites: dynamic dashboards, auth-heavy flows, voice experiences, AI-generated UI states, pricing changes, onboarding branches, and release sweeps defined in spreadsheets. The real constraint was not whether we could automate one happy path. It was whether the team could keep up with change without losing evidence.
Markdown-first specs
Tests stay readable by product, QA, and engineering. The parser turns markdown into structured steps, expectations, personas, and optional mobile viewports.
Release matrix ingestion
CSV release plans can be synced into markdown test cases, tagged as automated, semi-automated, or manual, then executed as one release pipeline.
Evidence-rich reports
Every run can emit screenshots, GIFs, video, conversations, HTML summaries, and machine-readable JSON for downstream reporting.
Ops integrations
Linear issue creation, Sentry correlation, persona refresh, and optional Supabase uploads turn test runs into something the team can actually act on.
Operational Evidence
Two products, two different QA problems, one shared testing substrate
This is the part that made open-sourcing credible. We were not packaging a demo. We were extracting the reusable kernel from systems already doing serious work inside two production codebases.
TheBlueOne
Embedded `apps/qa-agent` deployment supporting large release QA, UAT sweeps, auth checks, pricing and billing validation, and builder workflows.
InterviewLM
Embedded `apps/qa-agent` deployment covering smoke, auth, session-flow, stage-types, results UI, PMF checks, voice flows, screenshots, and demo capture.
The release-QA proof point
One archived Juliet release summary from March 15, 2026 recorded 107 total tests, 74 passes, and 33 failures. That is exactly the kind of evidence trail we wanted: large enough to matter, structured enough to route into engineering, and visual enough to debug quickly.
Why It Worked
The product got stronger once we stopped treating QA as a script-writing problem
The core design move was simple: keep test authoring legible, keep execution agentic, and keep outputs operational. That combination is what let the same system scale from smoke tests to release sweeps.
The tests are written in product language instead of selector language, so they survive UI change better and can be reviewed by non-engineers.
Persona-aware auth avoids the usual dead end where AI browser tests spend their whole budget logging in or fighting anti-bot friction.
Artifact capture makes failures legible. Engineers do not just get a red status; they get visual evidence, agent reasoning, and run context.
Release mode accepts the reality that not every check should be fully automated. Manual and semi-automated cases still live in the same operating system.
Open-Source Scope
CLI workflow
`run`, `report`, `auth`, and `release` cover the common QA loop without shipping a giant platform.
Model-agnostic core
The starter stays configurable, but our production operating guidance points hard at Gemini 3.1 Flash Lite for this job.
Release-compatible
CSV import exists because real release managers still work from matrices, not just code-owned test folders.
Proof Of Execution
The QA evidence ships with the narrative
We ran the vendored QA Agent directly against the live `frenxt` UI on April 16, 2026. The primary suite run `2026-04-16T21-27-46` passed 4 of 4 checks in 188 seconds across the homepage, case study, research index, and article detail pages. A second proof-focused run `2026-04-16T21-31-57` passed 3 of 3 checks in 107 seconds and produced the GIF artifacts shown below.
Primary UI Run
4 / 4 passed
Run `2026-04-16T21-27-46` validated the homepage, case study, research index, and the Gemini article detail page.
Proof Capture Run
3 / 3 passed
Run `2026-04-16T21-31-57` generated the motion-proof assets used for publication review.


Model Choice
Gemini 3.1 Flash Lite became the best fit for this workflow
That claim is based on operating experience, not generic benchmark chest-thumping. TheBlueOne config and InterviewLM runner both converged on Gemini 3.1 Flash Lite because browser QA values a very specific blend of speed, cost discipline, and instruction adherence.
Why it held up in production
Fast enough for parallel browser sessions without turning every regression sweep into an overnight job.
Why it held up in production
Cheap enough to justify broad smoke and release coverage instead of rationing runs to a tiny critical path.
Why it held up in production
Reliable enough to follow multi-step markdown instructions, use QA tools, and produce crisp PASS or FAIL verdicts.
Why it held up in production
Already proven in both TheBlueOne and InterviewLM configurations, where the runtime converged on `google/gemini-3.1-flash-lite-preview` for day-to-day execution.
Outcome
QA Agent is now a marketable product, not just an internal convenience
The open-source package matters because it compresses a real operating pattern: author tests in markdown, execute them with a browser agent, capture enough evidence to debug, and keep release QA tied to the tools teams already use. That is a much stronger story than “we built another testing wrapper.”