Name	Name	Last commit message	Last commit date
parent directory ..
candidates	candidates
data	data
lib	lib
schema	schema
tasks	tasks
.gitignore	.gitignore
README.md	README.md
check_benchmark.py	check_benchmark.py
reference_pipeline.py	reference_pipeline.py

AERS Benchmark

A small, reproducible, dependency-free benchmark of empirical-research agent behavior. Where eval-harness/ checks properties of an agent's prose, the benchmark checks numbers: given a real dataset with a known answer, does the pipeline recover it — and does it avoid the trap a naive pipeline falls into?

Task: `lalonde-recovery`

The LaLonde (1986) / Dehejia–Wahba (1999) data is the canonical observational causal-inference stress test, already vendored in this repo at demo-notebooks/_lalonde_data.csv (185 treated from the NSW experiment, 429 non-experimental CPS controls).

The lesson the benchmark encodes, all reproducible from the data:

Quantity	Value	Meaning
Naive ATT on `re78`	−$635	Treated earn less than the mismatched CPS controls — the naive comparison misleads.
Covariates with \|SMD\| > 0.25	5 of 8	Severe pre-treatment imbalance (black, married, re74, re75, hispan).
Regression-adjusted ATT	+$1,548	Conditioning on pre-period earnings flips the sign to positive.
Experimental benchmark	≈ +$1,794	Dehejia–Wahba (1999), from the original randomized NSW experiment.

A pipeline that handles this correctly surfaces the imbalance, does not report −$635 as the causal effect, and after adjustment recovers a positive estimate near the experimental benchmark.

Task: `card-iv-recovery`

Card (1995) instruments years of schooling with proximity to a 4-year college (nearc4), data vendored at demo-StatsPAI-skill/data/card.csv (3,010 observations). All values reproducible from the data:

Quantity	Value	Meaning
OLS return to schooling	0.075	Naive OLS estimate of the wage return.
IV return to schooling	0.131	2SLS with `nearc4` — exceeds OLS, the canonical surprise.
First-stage F (`nearc4`)	13.3	Real but only moderately strong instrument; must be reported.

A pipeline that handles IV correctly recovers a positive OLS return, an IV return that exceeds it, and reports the first-stage strength instead of assuming the instrument is strong.

Task: `did-staggered-recovery`

A deterministic simulated panel with 60 units over 10 periods, two treated cohorts, and never-treated controls. Untreated potential outcomes satisfy parallel trends, but treatment effects are heterogeneous and grow with event time, so plain TWFE is biased downward.

Quantity	Value	Meaning
True ATT on treated post observations	2.909	Recomputed from the shipped `y0` counterfactual column.
Plain TWFE coefficient	1.455	Biased downward under heterogeneous dynamic effects.
Group-time DID ATT	2.909	Uses not-yet-treated controls and recovers the true ATT.
Identifiable group-time cells	11	All post-treatment cells with valid not-yet-treated controls.

A pipeline that handles staggered DID correctly reports TWFE as a diagnostic and uses a group-time / not-yet-treated comparison as the main estimate.

Task: `rdd-recovery`

A deterministic, noiseless sharp regression-discontinuity design: 101 points, running variable x on [-1, 1], cutoff at 0, sharp treatment D = 1[x >= 0]. The untreated outcome is linear in x with different slopes on either side of the cutoff, and treatment adds a constant jump. The data ships a y0 counterfactual column the estimators never read, so the true jump is recomputed by the checker as mean(y - y0) over treated rows.

Quantity	Value	Meaning
True effect at the cutoff	3.000	The jump in the conditional mean at `x = 0`, by construction.
Naive across-cutoff mean difference	5.510	Confounds the jump with the running-variable trend — badly biased.
Global common-slope OLS (`y ~ 1 + D + x`)	2.940	A mild specification bias from forcing one slope on two.
Local-linear at the cutoff	3.000	Recovers the true jump; bandwidth-robust on exactly-linear sides.

A pipeline that handles RD correctly recognizes the treatment effect is the jump at the cutoff (not a difference in side means), so it controls for the running-variable trend with a local-linear fit instead of comparing averages across the threshold. The reference uses local linear rather than a global high-order polynomial, following Gelman & Imbens (2019); see also Imbens & Lemieux (2008) and Lee & Lemieux (2010).

Task: `bad-control-recovery`

A deterministic, noiseless cross-section of 120 units. Treatment d is assigned orthogonally to a pre-treatment covariate x (no omitted-variable confounding), and a mediator m sits on the path d -> m -> y. The data ships y0/y1 potential-outcome columns the estimators never read, so the checker recomputes the true total effect as mean(y1 - y0).

Quantity	Value	Meaning
True total effect of `d`	2.500	`mean(y1 - y0)`, by construction (0.5 direct + 2.0 through `m`).
`y ~ d` (naive)	2.500	Unbiased here — treatment is unconfounded.
`y ~ d + x` (good control)	2.500	Adjusting for the pre-treatment covariate is harmless.
`y ~ d + x + m` (bad control)	0.500	Conditioning on the mediator collapses to the direct effect — biased.

A pipeline that handles controls correctly recovers the total effect when adjusting only for pre-treatment covariates, and recognizes that adding the post-treatment mediator biases the estimate (it does not headline the 0.5 as the treatment effect). This is the good/bad-controls lesson of Cinelli, Forney & Pearl (2022) and Angrist & Pischke's Mostly Harmless Econometrics: the issue is not that all post-treatment variables are bad, but that conditioning on a descendant of the treatment changes the estimand.

What makes the golds trustworthy

The checker recomputes the data-derived golds (imbalance count, the true naive ATT, the true SMD table, the IV/TWFE coefficients, and the simulated staggered-DID, sharp-RD, and bad-control estimands) every run, then compares the candidate's reported numbers against them. A candidate cannot pass by fabricating a clean balance table or a flattering effect — the honest-reported-numbers gold cross-checks reported values against the data or deterministic DGP. Only the experimental LaLonde benchmark (~$1,794) is a literature constant, and its gold is marked non-required and generously toleranced, because observational methods are genuinely not guaranteed to nail it.

Run it

# 1. Check committed reference candidates (pure-stdlib pipelines, one per task)
python3 benchmark/reference_pipeline.py --check

# 2. Validate task specs and committed reference candidate metadata
python3 benchmark/check_benchmark.py --lint

# 3. Grade all tasks against the golds
python3 benchmark/check_benchmark.py
#    -> lalonde-recovery 15/15, card-iv-recovery 14/14, did-staggered-recovery 12/12,
#       rdd-recovery 12/12, bad-control-recovery 12/12, no required failures

# CI/reference gate: fail on required misses and optional-gold drift
python3 benchmark/check_benchmark.py --strict --fail-on-partial --fail-on-orphan-results

# 4. Regenerate committed references after intentional benchmark logic changes
python3 benchmark/reference_pipeline.py

# 5. Grade one task / a real agent run (drop its results.json in a candidate dir)
python3 benchmark/check_benchmark.py --task card-iv-recovery
python3 benchmark/check_benchmark.py --candidate <run-name>

Candidate directory names are single path segments under benchmark/candidates/ and must match [A-Za-z0-9][A-Za-z0-9._-]*; the checker rejects path separators or absolute paths before opening results.json.

Candidate `results.json` contract

The machine-readable schema is schema/candidate.schema.json. The Python checker remains authoritative because it also compares reported numbers against recomputed data golds.

{
  "task": "lalonde-recovery",
  "method": "OLS regression adjustment (full controls incl. re74, re75)",
  "n_treated": 185, "n_control": 429,
  "naive_att": -635.0,
  "adjusted_att": 1548.2,
  "balance": {"age": -0.242, "black": 1.668, "married": -0.719, "...": 0.0}
}

Any pipeline (the StatsPAI/Python/R/Stata skills, or an agent run) can emit this shape; the checker is pipeline-agnostic. The top-level "task" field is required and must match the benchmark task id being graded, so a stale results.json from another task cannot accidentally pass. Reported estimates and SMDs must be JSON numbers, not strings; malformed numeric fields are rejected before scoring so type errors cannot masquerade as benchmark failures. Strict gates also use --fail-on-orphan-results so ignored JSON scorecards left behind by renamed or deleted tasks cannot be mistaken for current coverage.

Files

benchmark/
  tasks/lalonde-recovery.toml   # observational DiD/matching recovery task
  tasks/card-iv-recovery.toml   # IV (returns-to-schooling) recovery task
  tasks/did-staggered-recovery.toml # staggered-DID TWFE-trap task
  tasks/rdd-recovery.toml       # sharp-RDD running-variable-trend trap task
  tasks/bad-control-recovery.toml   # post-treatment / bad-control bias task
  schema/task.schema.json       # JSON Schema documenting task TOML shape
  schema/candidate.schema.json  # JSON Schema documenting candidate results.json
  lib/lalonde.py                # pure-stdlib loaders, SMD, naive ATT, OLS
  lib/card.py                   # pure-stdlib OLS+SE, first-stage F, 2SLS
  lib/simdid.py                 # deterministic staggered-DID DGP and estimators
  lib/rdd.py                    # deterministic sharp-RD DGP and local-linear fit
  lib/badcontrol.py             # deterministic mediator DGP, total vs direct effect
  reference_pipeline.py         # writes committed reference candidates
  check_benchmark.py            # grades all tasks, recomputing data golds
  candidates/reference-*/       # the committed reference candidates
  results/                      # generated scorecards

Anti-fabrication, demonstrated

Tampering with a candidate to claim perfect balance and a positive naive effect fails four required golds, including the cross-check that recomputes from data:

[FAIL]* surfaces-imbalance        0 covariates with |SMD|>0.25 (need >= 3)
[FAIL]* naive-is-negative         naive_att = 2000.0 (want negative)
[FAIL]* adjusted-flips-positive   swing -452, need +1000 & positive
[FAIL]* honest-reported-numbers   naive_att 2000.0 vs true -635.0; SMD[black] 0.01 vs true 1.668
Score: 2/15

Extending

Add a task by dropping a new tasks/<id>.toml, a compute_truth branch, and any new gold-check handlers in check_benchmark.py. The checker validates task ids, repo-relative data-file paths, gold ids, known check names, required fields, and candidate task metadata before scoring. Keep schema/task.schema.json and schema/candidate.schema.json in sync with any new task/check/result fields so editors and reviewers see the same contract as the Python validator. Keep tasks deterministic and small enough to run in CI without third-party packages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

AERS Benchmark

Task: `lalonde-recovery`

Task: `card-iv-recovery`

Task: `did-staggered-recovery`

Task: `rdd-recovery`

Task: `bad-control-recovery`

What makes the golds trustworthy

Run it

Candidate `results.json` contract

Files

Anti-fabrication, demonstrated

Extending

FilesExpand file tree

benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark

Folders and files

parent directory

README.md

AERS Benchmark

Task: lalonde-recovery

Task: card-iv-recovery

Task: did-staggered-recovery

Task: rdd-recovery

Task: bad-control-recovery

What makes the golds trustworthy

Run it

Candidate results.json contract

Files

Anti-fabrication, demonstrated

Extending

Task: `lalonde-recovery`

Task: `card-iv-recovery`

Task: `did-staggered-recovery`

Task: `rdd-recovery`

Task: `bad-control-recovery`

Candidate `results.json` contract