Skip to content

Investigate and propose a performance budget for axe-core #5176

Description

@Garbee

Summary

axe-core has no defined performance budget. We track functional correctness exhaustively in
CI, but nothing measures, baselines, or gates bundle size or runtime cost. The pieces to
build one mostly exist already — they're just report-only and never compared against a limit.

This is an investigation, not an implementation. The deliverable is a written, reviewable
performance-budget proposal produced by the assigned engineer — covering what to measure, the
reference page it's measured against, candidate thresholds, and a recommended path to enforce it.
Actually building the harness, metric export, and CI gates is follow-up work scoped out of this
issue. The sections below are starting material for that investigation, not a committed design.

The proposal should, at minimum, address: (1) a reference DOM profile — an explicit, documented
target page size that every runtime number is calculated against (the engineer chooses the number
deliberately; this issue doesn't prescribe it), (2) which performance dimensions to budget and
proposed threshold values, and (3) a recommendation for how enforcement would eventually plug into
the build/CI so regressions are caught at PR time instead of discovered by consumers.

Motivation

  • axe-core runs inside the consumer's page on every audit, so both payload size and execution
    time are felt directly by users. The README markets axe as "fast, secure, lightweight"
    (README.md:11) but attaches no numbers to that claim.
  • We already acknowledge a real ceiling: doc/API.md:962-992 documents that pages with >50K
    elements can take 10s+, and points consumers at resultTypes and skipping color-contrast as
    mitigations. That's a reactive workaround for a cost we don't currently measure or bound.
  • There have been 50+ perf: commits since 2014 (color-contrast v4.0, selector caching perf(selector): more caching for faster selector creation #4611,
    runPartial timing chore: add perf timing to runPartial #5103, …). perf: is a sanctioned commit type. We invest in performance but
    have no guardrail to keep those gains from silently eroding.

Current state (what we already have)

Runtime instrumentation — rich, but console-only.
lib/core/utils/performance-timer.js is a full User Timing wrapper, enabled per run via
performanceTimer: true (doc/API.md:436, wired in lib/core/public/run.js:36-37). It emits a
layered breakdown via window.performance.mark()/measure():

  • Total audit (axe) and per-frame audit (audit_start_to_end) — performance-timer.js:30-71,
    lib/core/public/run-rules.js:35-50
  • Per-rule total (rule_[ruleId]), gather (#gather), visibility filtering, matches (#matches),
    and checks (runchecks_[ruleId]) — lib/core/base/rule.js:130-166, 369-404, 445-474
  • after processing (audit.after) and reporterrun-rules.js:62-73, run.js:42-45,67-69
  • Gather logging already includes node counts: gather for [ruleId] ([count] nodes): [ms]ms
    (rule.js:382-387)
  • Works under runPartial too (run-partial.js:31-32,38-39)

Covered by test/core/utils/performance-timer.js (15+ cases).

Build pipeline — a first-party Node script.
npm run buildnode build/run-build.mjs (package.json:79) →
build/run-build/full-build.mjs:18-71, which runs clean → validate → metadata → esbuild →
configure → babel → concat → uglify (build/run-build/concat-uglify.mjs:59, via uglify-js) →
aria-docs → locale template → postbuild → bytesize. This matters for a budget: the size-reporting
step is first-party JS we control, so adding a threshold is just editing our own build code. The
pipeline even has its own unit tests, run via npm run test:build
node --test build/run-build/*.test.mjs (package.json:124).

Bundle-size reporting — report-only.
The final build step is a homegrown runBytesize() (build/run-build/postbuild.mjs:20-32, invoked
at full-build.mjs:64-65). It iterates every locale variant × .js/.min.js, stats the file, and
just console.logs ${name}: ${bytes} bytes — no threshold, no gzip, no diff. Current sizes:

Artifact Raw Gzipped
axe.js (published main, package.json:56) ~1.2 MB (1,292,628 B)
axe.min.js ~561 KB (574,862 B) ~151 KB (154,444 B)

There's strong prior art for a hard size gate: assertEsbuildImportLimits()
(build/run-build/esbuild-core.mjs:26-40) already fails the build via assert when a module
exceeds a max import count or maxSize — but it's applied only to the gather-internals entry
({ max: 10, maxSize: 4000 }, esbuild-core.mjs:69-72), not the shipped bundle.

CI gating.
.github/workflows/test.yml runs ~15 jobs; deploy.yml:26-56,125-126 makes Test-workflow success
the sole merge blocker. The build job (test.yml:55-71) already runs npm run prepare && npm run build (so bytesize runs) and uploads axe.js as an artifact (test.yml:67-70) consumed by
downstream jobs via download-artifact (test.yml:147).

Gaps a budget must fill

  1. No machine-readable metrics. logMeasures() only does
    this._log('Measure ' + name + ' took ' + duration + 'ms') (performance-timer.js:111-144);
    timing is not in the results object, and marks are cleared after measurement by default
    (:92-104), so a PerformanceObserver can't reliably catch them. A budget needs a structured
    extraction path. (net-new)
  2. No benchmark suite / representative fixtures. No large-DOM stress pages exist
    (test/assets/ is media only). No baselines, no time-series, no cross-version comparison.
    (net-new)
  3. No thresholds on the bundle. runBytesize() only logs; nothing fails on regression. No
    gzip/brotli sizing. (Contrast with assertEsbuildImportLimits(), which does hard-fail — but
    only for the gather-internals module, not the shipped bundle.)
  4. No CI performance gate. .github/workflows/ has zero references to size/perf/budget;
    runBytesize() output is buried in the build logs — never a step summary, never an artifact,
    never diffed base-vs-head.
  5. No per-check granularity (checks collapse into runchecks_[ruleId]) and no
    frame-collection/iframe-overhead timing
    .
  6. No memory tracking — wall-clock only.

Reference DOM profile — the anchor for every runtime number

A wall-clock or per-rule budget is meaningless without a defined page to run it against: "axe takes
40ms" only means something relative to how big the DOM was. So a core part of the proposal is for
the engineer to define a canonical reference DOM — a documented target page size (and shape)
that all runtime calculations are expressed against — rather than measuring against arbitrary,
undocumented fixtures.

This issue does not prescribe the number; that determination is the engineer's to make. What it
does require is that the determination be made deliberately. The following are offered purely as
example data points to feed that decision so the thinking happens — not as the answer:

  • Lighthouse's "Avoid an excessive DOM size" audit — a widely recognized public threshold for
    DOM weight: it warns once a page exceeds ~800 nodes and scores poorly around ~1,400
    nodes
    , with companion guidance of DOM depth ≤ 32 levels and ≤ 60 child elements under any
    single parent.
  • axe's own documented pain point of >50K elements taking 10s+ (API.md:962-970), as the far end
    of the scale.

From inputs like these, the engineer should decide and document: the target size the budget's
headline number is anchored to, any secondary/warning band, whether larger sizes are tracked as a
scaling check (linear-growth, not pass/fail), and — importantly — the DOM shape, not just a
node count (realistic depth/breadth, plus variants for known-expensive cases like iframe-heavy or
color-contrast-heavy pages). Every ceiling in Dimensions C and D below is then stated per this
reference
, with per-rule budgets normalized to a per-node cost (e.g. ms / 1,000 nodes) so they stay
fixture-independent.

Candidate budget dimensions to evaluate

A menu for the engineer to assess and recommend from — not a committed scope. Ordered roughly by
implementation realism, with the rough cost of eventual follow-up work flagged to inform
prioritization. Dimensions C and D would be expressed against the Reference DOM profile above.

# Dimension Why it matters How we'd measure it New work
A Minified bundle size (axe.min.js, English) Lightest lift — runBytesize() already stats it in the build; a hard, consumer-facing payload number today (574,862 B). Add a threshold check to runBytesize() (postbuild.mjs:20-32) that asserts against a committed ceiling — directly mirroring assertEsbuildImportLimits() (esbuild-core.mjs:26-40), which already fails the build the same way. Failing npm run build auto-gates the CI build job. Threshold file + assert. Small.
B Gzipped bundle size Real-world delivery cost; raw bytes overstate network impact. gzip the built axe.min.js inside runBytesize(), compare to ceiling. gzip step (no gzip/brotli sizing exists today). Small.
C Cold-run wall-clock on the reference DOM The metric consumers actually feel. The headline budget is the total axe time on the chosen target page; any secondary/stress bands the engineer defines are tracked too (larger ones as a scaling check against the documented >50K pain, API.md:962-970). Build the reference fixtures, run axe.run({ performanceTimer: true }), capture the total axe measure per band. Browser harness can reuse test/wtr.config.mjs; Node path can reuse the JSDOM job. Fixtures + runnable harness + structured metric export. Medium–large — the flagship investment.
D Per-rule time ceiling (per-node, anchored to the reference) Catches a regression in one rule (historical hot spots: color-contrast, td-has-headers) before total time blows up. Granularity already exists. Parse rule_[ruleId] measures from a reference run; assert each stays under a per-node ceiling (e.g. ms / 1,000 nodes) normalized to the reference so it's fixture-independent. Node counts already logged (rule.js:382-387). Structured export (Gap 1) + per-rule baseline table. Medium; depends on C's harness.
E Memory / peak heap Catches allocation regressions invisible to wall-clock. No existing hook. Fully net-new. Defer — out of scope for v1.

A plausible phasing the proposal might recommend: A first (near-free, immediate value), B
alongside it, then C as the flagship runtime budget, with D layered on once a harness emits
structured metrics. E deferred.

Enforcement options to consider

Background on where enforcement could eventually live, to inform the proposal's recommendation —
implementing it is follow-up work, not part of this issue.

  • Single merge gate already exists. deploy.yml:26-56 treats the whole Test workflow as the
    sole merge blocker, so any check added inside .github/workflows/test.yml inherits blocking status
    with no branch-protection change — likely cheaper than a separate workflow.
  • A/B (bundle): one surfaced option is to enforce in the build itself — have runBytesize()
    (postbuild.mjs:20-32) assert against committed ceilings the same way assertEsbuildImportLimits()
    already does, so a breach fails npm run build (and thus the existing build job, test.yml:55-71)
    with no new CI wiring. The build pipeline is unit-tested (npm run test:build), giving a second
    possible home for size assertions.
  • C/D (runtime): likely a dedicated job that download-artifacts the already-uploaded axe.js
    (test.yml:147) and runs a benchmark harness, kept separate from functional jobs so a timing flake
    is diagnosable in isolation.
  • Base-vs-head delta detection (regression %, not just an absolute ceiling) is the highest-value
    and least-obvious piece, since absolute ceilings drift as features land — worth a specific
    recommendation in the proposal.

Questions the proposal should resolve

The investigation should land a recommendation on each of these (or explicitly defer it):

  1. Absolute ceilings, relative deltas, or both? Absolute is trivial with bytesize; relative
    catches creep but needs a baseline-fetch mechanism that doesn't exist yet.
  2. Where do baselines live? A committed JSON (analogous to sri-history.json) is the path of
    least resistance, but couples every legit perf change to a baseline-bump commit. Or fetch the
    base branch's build at CI time?
  3. Runtime environment for the budget. GitHub runners are noisy — wall-clock thresholds need
    generous tolerance (the timer's own tests use a 17ms ANIMATION_FRAME_TOLERANCE_MS). Browser
    (WTR) vs. Node/JSDOM (where color-contrast doesn't run, per README.md:83) yield materially
    different numbers; pick one as the gating environment.
  4. Structured-metrics API. Does performanceTimer gain a structured/returnable output (vs.
    console-only logMeasures), or does the harness scrape
    window.performance.getEntriesByType('measure') before marks are cleared? This is the
    prerequisite for C/D and the one change touching shipped code (performance-timer.js).
  5. Reference DOM anchor. What target size (and shape) does the headline budget anchor to? This
    is the engineer's call to make and document, using inputs like Lighthouse's ~800/~1,400-node
    thresholds and axe's >50K pain point as guidance. Also decide which variants (iframe-heavy,
    color-contrast-heavy) become official fixtures — these are a maintenance surface.
  6. Failure policy. Hard-fail the PR, or warn + label override? Which dimensions should block vs.
    report-only to start?
  7. Per-rule normalization. Budget on per-node cost rather than absolute ms so Dimension D stays
    fixture-independent?
  8. Bundle scope. English-only axe.min.js, or all locale variants (runBytesize() already
    loops over every langs suffix, postbuild.mjs:21-23)? Does axe.js (the published main)
    also get a budget, or only the minified artifact?

Done = a budget proposal the team can act on

This issue is complete when the engineer has presented a written performance-budget proposal for
the team to review. That proposal should:

  • State a reference DOM target (size + shape), with the rationale and the data points
    considered — the central decision this work exists to force.
  • Recommend which dimensions to budget and propose concrete starting threshold values (or
    a clear method for deriving them).
  • Recommend how enforcement would eventually work — where it would plug into the build/CI and
    the failure policy — at enough depth to scope the follow-up, without building it.
  • Resolve (or explicitly defer, with a recommendation) the open questions above.
  • Break the implementation into follow-up tickets (e.g. metric export, benchmark harness,
    bundle-size gate, CI wiring) so the build work can be planned separately.

Out of scope for this issue

  • All implementation — harness, structured metric export, gzip sizing, threshold files,
    base-vs-head diffing, and CI gates are follow-up work once the proposal is accepted.
  • Memory / heap budgeting (Dimension E).
  • Per-check timing granularity and iframe-overhead timing.
  • External monitoring dashboards (SpeedCurve, Datadog, bundle-analyzer, etc.).

Everything cited above exists in-repo today and is provided as starting material for the
investigation. Everything marked net-new (benchmark harness, fixtures, structured metric export,
gzip sizing, base-vs-head diffing, threshold files) is follow-up work, not part of this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance related issuesungroomedTicket needs a maintainer to prioritize and label

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions