Skip to content

studio-11-co/falsify-cookbook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PRML Cookbook

Short, opinionated patterns for using PRML in real ML evaluation pipelines.

Spec DOI License

This is the field-manual for the PRML specification. The spec tells you what a manifest is. The cookbook tells you how to use it without shooting yourself in the foot.

Every pattern is:

  • One page — read in under three minutes
  • Self-contained — the example runs end-to-end with the snippets shown
  • Failure-mode-first — what goes wrong is named before what goes right

Patterns

# Pattern When to use
1 Single-shot eval claim One model, one benchmark, one number — the 90% case.
2 Multi-seed eval claim When you report mean ± std over N seeds.
3 Streaming Elo / arena eval Live leaderboards. (Uses v0.2 streaming variant.)
4 Dataset version pinning Benchmarks evolve; how to commit to a specific revision.
5 CI gate via prml-verify-action Block PRs that ship a model with a tampered eval claim.
6 Public registry anchoring When and when not to publish your hash publicly.
7 Revocation Withdrawing a manifest after publication. (v0.2 feature.)
8 Pre-registration without infrastructure The minimum-viable workflow: a YAML file and sha256sum.
9 RLHF win-rate evaluations Judge-model comparisons (AlpacaEval, MT-Bench, Arena-Hard).
10 Federated evaluation Multi-org replication: shared hash, distinct producers, regulator-grade audit trail.
11 PRML + Sigstore for execution integrity Closes the §8.1 gap: who ran the eval, when, against which exact artefacts.
12 PRML in Hugging Face model cards Make the accuracy number on a published HF model card verifiable, not trust-me prose.
13 PRML + commit-reveal validation for independence attestationrunnable Closes the other §8.1 gap: structural proof that independent evaluators couldn't coordinate verdicts. Co-authored with ValiChord.

Anti-patterns

# Anti-pattern Why it bites
A1 Computing the hash after the run The whole point is committing before.
A2 Editing the manifest "to fix a typo" Any edit breaks the hash. Use revocation.
A3 Storing private data in the manifest The hash is published; the manifest content might be too.
A4 Treating the hash as proof of truth The hash proves commitment, not correctness.

Reference

  • Identity levels (0–4) — a non-normative ladder for the binding strength between producer and the real-world authoring entity. Used by Pattern 11 and the v0.3 RFC.

Audit & compliance crosswalks

Subcategory-by-subcategory maps from major AI governance frameworks to PRML fields (FULL / PARTIAL / NONE tagged):

Examples

Working code in examples/:

  • pytorch-imagenet/ — Full example: PRML manifest before a PyTorch ImageNet eval, hash committed, post-run verification
  • stable-baselines3-rl/ — RL agent on LunarLander-v2, mean episode reward claim, threshold direction >=
  • inspect-ai-refusal/ — Refusal-rate eval via Inspect AI, PRML pre-registration via falsify-inspect
  • huggingface-eval/lm-eval-harness integration, multi-task pre-registration

License

  • Documentation, patterns, examples: CC0 1.0 — public domain dedication. Mirror, fork, modify without attribution.
  • Any tooling: MIT.

Contributing

Pattern proposals welcome via PR. Each new pattern must:

  1. Solve a real problem someone hit while implementing PRML
  2. Be reproducible — name the tools and their versions
  3. Include a "what doesn't work" section (we are not selling)
  4. Be under 800 words

Open an issue first if you're unsure whether your pattern fits.

Authors

Cüneyt Öztürk Contact: hello@falsify.dev · falsify.dev


Status

  • v0.1 stable. v0.2 RFC open through 2026-05-22 — spec.falsify.dev/v0.2-rfc.
  • The PRML JSON Schema is in the SchemaStore catalog (merged 2026-05-11), so *.prml.yaml files autocomplete in VS Code, JetBrains, Helix, Zed, and Cursor out of the box.

Contributing

See CONTRIBUTING.md and the good first issue label for scoped work.

Cite the spec: Öztürk, C. (2026). PRML v0.1. Zenodo. https://doi.org/10.5281/zenodo.20177839