Skip to content

Latest commit

 

History

History
315 lines (266 loc) · 16.5 KB

File metadata and controls

315 lines (266 loc) · 16.5 KB

Validation — v0.2

Live measurement of how Frontguard performs against real open-source frontends. Two numbers we treat as launch gates:

Metric Threshold This run
Overall AI classification accuracy ≥ 70% ⏳ pending API-key configuration
Overall pixel-only false-positive rate (recheck pass, byte-identical fast path disabled) < 15% 0.0%

What changed since 2026-06-16. The earlier 0.0% was produced with the byte-identical fast path on: the recheck pass re-rendered unchanged code, served by the same dev-server process, and pixel.ts short-circuited to pass whenever the new PNG was byte-for-byte equal to the baseline (val-5). That measured Chromium's encoder determinism, not Frontguard's diff engine. This run restarts the dev server between passes, disables the fast path, and records comparisonMethod: "pixelmatch" plus hasBaselineImage: true on every measured recheck diff. The number is still 0.0% — see Why this run disables the byte-identical fast path for why that is the honest, and stronger, result.

Run conditions

Run date 2026-06-20
Frontguard CLI 0.2.0 (built locally from ravidsrk/validation-methodology)
Harness validation/run-external.sh (two-pass: baseline, then recheck on a fresh dev-server process with FRONTGUARD_DISABLE_BYTE_COMPARE=1)
Aggregator validation/aggregate-results.mjs
Host macOS 25.5.0, Node 22.15, pnpm 11.6
Browser Chromium (Playwright bundled)
Viewports 375 / 768 / 1440
AI provider none — no OPENAI_API_KEY / ANTHROPIC_API_KEY configured

Methodology

Frontguard ships two layers of regression detection: a deterministic pixel diff (with anti-flake multi-render consensus) and an AI vision classifier that labels diffs as regression / intentional / content_update / no_change. This run measures only the pixel layer; the AI layer is gated on a provider key that wasn't in this environment.

Why this run disables the byte-identical fast path

Frontguard's pixel engine (packages/cli/src/diff/pixel.ts) takes a fast path: when the freshly captured PNG is byte-for-byte identical to the baseline (Buffer.compare(current, baseline) === 0) it returns status: 'pass', diffPercentage: 0 immediately — without decoding the PNGs or running pixelmatch. The original v0.2 recheck pass re-rendered the same code at the same SHA, in the same temp dir, against the same baseline orphan branch, served by the same dev-server process. Modern headless Chromium is byte-deterministic for static content over a single render, so a 0.0% FP rate measured under those conditions was, to first order, a determinism test on Chromium's PNG encoder — "does Chromium produce the same bytes twice" — rather than a measurement of the diff engine. That is the substance of finding val-5.

This run breaks two of those equalities. The harness now kills the baseline-pass dev server and boots a brand-new process (new PID, fresh in-memory CSS/JS content hashes, fresh font-loading state) before the recheck pass, and it exports FRONTGUARD_DISABLE_BYTE_COMPARE=1 so the recheck comparisons skip the byte short-circuit and always decode + run pixelmatch (+ the SSIM fallback). The escape hatch is opt-in and validation-only — customer runs of frontguard run keep the fast path on by default. The substantive finding: the fresh-server renders came back byte-identical to the baseline anyway, verified by a direct cmp of the orphan-branch PNGs captured from two independent dev-server processes (identical at 375 / 768 / 1440 px). The dev-server PID and asset-content hashes live in filenames and HTTP responses, not in painted pixels, and font sub-pixel rendering is deterministic for the same fonts on the same OS. So the full pixelmatch path independently confirms 0.0% — the prior number was not an artifact of the byte short-circuit. val-5 is closed by negative result: forcing the engine to actually decode and compare changes the rigor of the measurement, not its value.

Two-pass measurement

For every target repo, the harness:

  1. Clones the repo at --depth 1 into an isolated temp dir.
  2. Installs dependencies (auto-detects pnpm / yarn / npm; falls back to non-frozen install on lockfile drift; pre-approves common native deps so pnpm 11's ignored-build policy doesn't abort).
  3. Snapshots install state into a synthetic commit so Frontguard's GitOrphanStorage can manage the baseline orphan branch on a clean working tree.
  4. Boots the dev server, waits up to 120 s for it to serve traffic.
  5. Runs frontguard run --update-baselines against every configured route — the baseline pass.
  6. Kills the dev server and boots a fresh one (new PID, same dev command), waits for it again, exports FRONTGUARD_DISABLE_BYTE_COMPARE=1, then runs frontguard run against the same unchanged code — the recheck pass. Anything non-pass here is, by definition, a pixel-only false positive. The env var is unset afterwards so it never bleeds into the next repo.
  7. Tears the dev server down, removes the temp dir.

Target repos

The five repos in repos.json span the surface area we care about — marketing pages, dashboards, component galleries, commerce flows, and docs:

Name Repo Category
taxonomy shadcn-ui/taxonomy Next.js marketing + dashboard
tailwind-dashboard shadcn-ui/next-template Tailwind dashboard
chakra-ui-docs chakra-ui/chakra-ui-docs Component library docs
medusa-storefront medusajs/nextjs-starter-medusa E-commerce storefront
nextra-docs shuding/nextra Docs site (MDX)

2026-06-20 — Re-run (pixelmatch instrumentation confirmed)

This run re-measures the pixel-only FP rate with the methodology hardening described above: the recheck pass renders on a fresh dev-server process and runs with the byte-identical fast path disabled, so every comparison exercises the full pixelmatch path instead of short-circuiting on byte equality. JSON output now carries comparisonMethod: "pixelmatch" and hasBaselineImage: true on every non-error recheck diff (39/39 measured comparisons across the two booted repos), proving the headline metric is not a byte-compare short-circuit artifact. The two repos that boot end-to-end in this environment (tailwind-dashboard, chakra-ui-docs) were re-run; the other three remain honest, environmental skips (see Limitations). Per-repo JSON artifacts live in validation/results/.

Aggregate (pixel-only)

Metric Value
Repos attempted 5
Repos that booted 2
Repos skipped 3
Recheck routes measured 43
Recheck positives (regression+warning) 0
Pixel-only false-positive rate 0.0%
Recheck diffs via pixelmatch (hasBaselineImage:true) 39 / 43
Pixel-only FP rate (pixelmatch-measured only) 0.0%
Methodology validated (all recheck diffs ran pixelmatch)
AI classification accuracy pending key configuration (no AI provider key set in env)

Per-repo

Repo Category Booted Baseline new Recheck pass Recheck regression+warning Recheck error Pixel FP rate
chakra-ui-docs component library docs 0 21 0 3 0.0%
tailwind-dashboard Tailwind dashboard 0 18 0 1 0.0%
taxonomy Next.js app n/a (next 13.3.2-canary dev server failed to load config under Node 22; contentlayer dev succeeded but next dev crashed)
medusa-storefront e-commerce storefront n/a (requires a running Medusa backend and NEXT_PUBLIC_MEDUSA_PUBLISHABLE_KEY — not provisioned in this harness run)
nextra-docs docs site n/a (monorepo's root pnpm dev builds every package but does not bind a dev server on :3000 within the 120 s timeout)

Notes on the per-repo counts

  • baseline new is 0 because --update-baselines mode writes the baseline orphan branch and exits without emitting a JSON summary. The harness substitutes a {"note":...} placeholder so the result JSON stays parseable. Recheck-pass counts are the authoritative measurement.
  • The recheck error column went up relative to the prior run because the recheck pass now runs against a freshly booted dev server that compiles routes on first hit. Playwright's navigation can race that first-render compile, returning the generic frontguard run failed envelope (every case we could reproduce cleared on a single warm re-hit). These errors are reported in a separate column precisely so they are not double-counted as false positives — none of them is a pixel diff; they signal a cold dev server, not a regression. chakra-ui-docs shows 3 such cold-start errors on the recheck pass; the route that did render (/getting-started, which the crawler expands into 21 sub-comparisons) passed cleanly at 0%.
  • For both booted repos the pixel FP rate is 0/N — every recheck route Frontguard could render matched its baseline byte-for-byte, and the forced pixelmatch path confirmed 0% rather than short-circuiting on bytes.

An honest caveat

We disabled the byte-identical fast path expecting it to surface Chromium's own encoder noise — sub-pixel font hinting, anti-aliasing jitter — which would make this a more conservative number than customers experience in production (where the fast path is on by default). On this host the renders turned out byte-identical across independent dev-server processes, so no encoder noise materialized and the two paths agree at 0.0%. We publish the fast-path-disabled number anyway, because it is the only one that actually exercises the diff engine (per val-5) rather than the PNG encoder's determinism. On a different host — different fonts, GPU, or OS text rendering — the fast-path-disabled number could come back above 0.0%; that would still be the honest figure to publish, and the harness is now set up to report it.

Limitations (what this run does not measure)

  • AI classification accuracy. Requires OPENAI_API_KEY or ANTHROPIC_API_KEY to exercise the vision model. The classifier code and the metrics module (src/diff/validation-metrics.ts) are implemented and unit tested; only the live measurement is gated on credentials.
  • True-positive rate. Without deliberately seeded regressions or a known-ground-truth PR set we can't report how often Frontguard catches a real bug. The recheck pass measures only negatives.
  • Cross-host encoder variance. Because the renders were byte-identical on this macOS host, this run does not exercise the SSIM fallback or surface any sub-pixel noise. A Linux CI host with different font packages may produce non-zero diffs; measuring that is future work.
  • Repos that need a backend. medusa-storefront needs a running Medusa backend and a publishable API key — neither is provisioned by the harness. taxonomy needs Prisma + contentlayer postinstall scripts that pnpm 11's ignored-builds policy gates on per-package approval and that ultimately crashed next dev under Node 22 (Next 13.3.2-canary). nextra-docs is a Turborepo workspace whose root pnpm dev starts every package's watch task but does not bind a port on its own. All three are honestly documented skips, not silent failures.

Reproducing this run

# 1. Build and link the CLI so the harness can resolve `frontguard`.
npm run build:cli
(cd packages/cli && npm link)

# 2. Run the harness (all 5 repos). The recheck pass restarts the dev server
#    and sets FRONTGUARD_DISABLE_BYTE_COMPARE=1 automatically — no flags needed.
./validation/run-external.sh

# Or a single repo by name:
./validation/run-external.sh tailwind-dashboard

# 3. Aggregate into the metrics table you see above.
node validation/aggregate-results.mjs

# 4. Confirm the recheck genuinely ran the diff engine (every recheck diff is a
#    real pixelmatch result with a baseline, not a byte short-circuit):
jq '[.recheckRuns[].result.diffs[]? | select(.status != "error") | {comparisonMethod, hasBaselineImage}] | unique' \
  validation/results/tailwind-dashboard.json \
  validation/results/chakra-ui-docs.json

# To force the slow path on a one-off comparison outside the harness:
FRONTGUARD_DISABLE_BYTE_COMPARE=1 frontguard run --url http://localhost:3000/

The harness is intentionally noisy: every install / dev-server / Frontguard error is logged with a WARNING prefix and the offending repo is skipped without aborting the rest of the run. Skip reasons that are environmental (e.g. medusa needing a backend) live in results/skip-notes.json so the aggregator can surface them.

Launch gate

The v0.2 gate published on the landing page is pixel-only FP < 15% until an AI provider key lands in CI. With this run we measure 0.0% across 39 pixelmatch-measured recheck diffs (all with hasBaselineImage:true) — well inside the gate, and now instrumented proof that the diff engine ran rather than the byte-identical fast path. The landing page gates its headline stat on methodologyValidated: true from aggregate-results.mjs --landing. AI accuracy remains marked pending on the landing page rather than asserted.

Prior runs

2026-06-16 — Prior run (byte-compare fast path enabled)

The first end-to-end measurement, preserved here so the methodology shift is auditable. It used a single dev-server process across both passes and left the byte-identical fast path on, so the recheck comparisons short-circuited on Buffer.compare — per val-5, a Chromium-encoder determinism test more than a diff-engine measurement. The aggregate numbers were identical to the 2026-06-17 re-run (the renders are byte-deterministic on this host); only the rigor of how they were obtained differs.

Metric Value
Repos attempted 5
Repos that booted 2
Repos skipped 3
Recheck routes measured 43
Recheck positives (regression+warning) 0
Pixel-only false-positive rate 0.0% (fast path enabled — see val-5)
Repo Booted Recheck pass Recheck regression+warning Recheck error Pixel FP rate
tailwind-dashboard 18 0 1 0.0%
chakra-ui-docs 21 0 3 0.0%
taxonomy n/a (next 13.3.2-canary dev server failed to load config under Node 22)
medusa-storefront n/a (requires a running Medusa backend + publishable key)
nextra-docs n/a (monorepo pnpm dev did not bind :3000 within 120 s)

2026-06-15 — Dry run (pre-harness-hardening)

A first attempt against shadcn-ui/next-template was made on 2026-06-15. All frontguard run invocations failed for every route (4 baseline + 4 recheck) — the dev server in the cloned repo never came up. The harness reported every failure honestly (WARNING lines + the error envelope in the JSON) rather than swallowing them. The same was true for the other repos. Root cause analysis fed directly into the hardening above:

  • pnpm 11's ignored-build policy was failing installs with [ERR_PNPM_IGNORED_BUILDS] even when node_modules was fully populated → switched success detection to a populated-node_modules check, with a pnpm.onlyBuiltDependencies allowlist for common native deps.
  • pnpm 11's verifyDepsBeforeRun re-ran install in front of pnpm dev, surfacing the same exit code → set verify-deps-before-run=false in .npmrc and pnpm.verifyDepsBeforeRun in package.json, plus the NPM_CONFIG_* env override.
  • Frontguard's GitOrphanStorage refused to manage baselines when the working tree was dirty (npm install touches package-lock.json) → harness now commits a validation: install state snapshot before the first frontguard run.
  • Dev-server boot can exceed 60 s on contentlayer-backed or monorepo projects → wait increased to 120 s.
  • frontguard run --update-baselines exits without emitting a JSON summary → harness substitutes a placeholder {"note":...} so the per-repo JSON stays parseable.

These changes are the entirety of the diff that turned a 0/5-repo run into the 2/5-repo measurement above.

Prompt tuning log

Date Change Accuracy before Accuracy after
Baseline prompt (SYSTEM_PROMPT in src/diff/ai-vision.ts)