xtask/ is the home for Provenant's Rust-based maintainer workflows that are
intentionally coupled to Provenant internals or to the repo-built provenant
binary. Small, self-contained hot-path tools that benefit from package-boundary
isolation live as separate workspace crates under tools/; the current
example is tools/license-headers/.
Run these commands directly with:
cargo run --manifest-path xtask/Cargo.toml --bin <command> -- ...| Command | Purpose |
|---|---|
benchmark-target |
Measure Provenant against an explicit local or remote benchmark target. |
compare-outputs |
Compare Provenant and ScanCode raw outputs, either by running both scanners on one target or by comparing two existing JSON files directly. |
perf-ab |
A/B-time a scan across two Provenant git refs (self before/after) and report the head-vs-base speedup. |
update-parser-golden |
Regenerate parser .expected.json fixtures from current Rust parser output. |
update-copyright-golden |
Maintain copyright golden YAML fixtures with parity-gated or Rust-owned update modes. |
update-license-golden |
Maintain license golden YAML fixtures with parity-gated or Rust-owned update modes. |
validate-urls |
Validate URLs in production docs and Rust docstrings. |
generate-output-field-reference |
Regenerate docs/OUTPUT_FIELD_REFERENCE.md from output-schema documentation metadata. |
generate-serve-openapi |
Regenerate the checked-in OpenAPI document for provenant serve. |
generate-supported-formats |
Regenerate docs/SUPPORTED_FORMATS.md from parser metadata. |
generate-benchmark-chart |
Regenerate the benchmark duration-vs-files SVG from timing rows in docs/BENCHMARKS.md. |
generate-index-artifact |
Regenerate the embedded license index artifact from ScanCode rules and licenses. |
classify-rule-overmatch |
Classify upstream license rules by overmatch-risk class and rank un-covered overlay candidates. |
benchmark-target measures Provenant against an explicitly supplied benchmark
target and reports a repeated-run matrix for:
- uncached runs
- incremental runs
This makes it useful for checking repeated-run speedups on unchanged input.
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --help
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --repo-url https://github.com/org/repo.git --repo-ref main --profile common
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --target-path /path/to/local/directory --profile common-with-compiled
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --repo-url https://github.com/org/repo.git --repo-ref v1.2.3 --profile licenses
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --target-path /path/to/local/directory --profile packages
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --repo-url https://github.com/org/repo.git --repo-ref <sha> --profile common
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --target-path /path/to/local/directory --profile common
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --repo-url https://github.com/org/repo.git --repo-ref <sha> -- -clupe
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --target-path /path/to/local/directory -- --timeout 300 --license-text
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- --target-path /path/to/local/directory -- --license --packageCLI arguments:
- Exactly one of
--repo-urlor one-or-more--target-pathvalues is required. --repo-url URL: benchmark the given repository URL via the shared repo cache.--repo-ref REF: required with--repo-url; commit SHA, tag, or branch to resolve and benchmark.--target-path PATH: benchmark an existing local directory in place.--profile common: convenience shorthand for-clupe --system-package --strip-root --processes 4.--profile common-with-compiled: convenience shorthand for-clupe --system-package --package-in-compiled --strip-root.--profile licenses: convenience shorthand for-l --strip-root.--profile packages: convenience shorthand for-p --strip-root.- Pass either a supported
--profileor explicit benchmark scan flags after--. - A common explicit profile is
-clupe(--copyright --license --url --package --email).
- Either scans a local directory passed via
--target-pathor resolves--repo-url+--repo-refthrough a shared repo cache. - Builds Provenant in release mode.
- Updates or creates a shared shallow repo cache under
.provenant/repo-cache/, fetches only the requested ref at depth 1, resolves it to a full commit SHA, and materializes a detached checkout for the run. - Runs cold/warm scenarios with isolated cache roots while forwarding the requested Provenant scan flags unchanged.
- Writes a run manifest plus benchmark results under
.provenant/benchmarks/. - Prints a summary table with wall time, key phase timings, peak RSS, and incremental reuse signals.
For each scenario, the command writes:
results/<scenario>/scan-output.jsonresults/<scenario>/provenant-stdout.txtrun-manifest.json
It also writes a tab-separated summary file at:
results/summary.tsv
Besides wall, engine, scan, and total times, the summary table and
results/summary.tsv record cumulative worker time for each per-detection
scan phase reported by the Provenant progress summary: scan:packages,
scan:licenses, scan:copyrights, scan:emails, scan:urls, and
scan:info. Phases for detections the scan flags did not enable stay empty.
- The command uses an explicit per-scenario
--cache-dirso incremental manifest results do not leak across scenarios. - The command also adds
--no-license-index-cacheto every Provenant invocation so repeated benchmark runs do not inherit warmed license-index state from earlier repositories. --target-pathmode scans the directory in place; it does not reset, stash, or otherwise mutate that path.--repo-urlmode requires--repo-ref; the command resolves that ref to a full commit SHA and records the exact SHA inrun-manifest.json.- Warm-run comparisons are meaningful only within one invocation because the command recreates
.provenant/benchmarkson every run. - Benchmark artifacts are kept in the repo-local
.provenant/developer artifact directory rather than/tmp, so they stay near future comparison runs and are easier to inspect before cleanup. - Repo URL runs reuse cached git objects from
.provenant/repo-cache/instead of recloning the upstream repository on every invocation. run-manifest.jsonrecords the Provenant binary version plus the current Provenant repository revision, dirty state, and diff hash for the run, so benchmark snapshots stay attributable as the scanner evolves.- On macOS, the command falls back to
/usr/bin/time -l; on systems with GNUtime, it uses verbose memory reporting automatically.
compare-outputs compares Provenant and ScanCode raw outputs and produces
reduced comparison artifacts for later manual or agent review.
It supports two input modes:
- scan-target mode: run both scanners on the same repository or local target
- direct-json mode: compare one existing ScanCode JSON file and one existing Provenant JSON file without rerunning either scanner
- Scan-target mode requires Docker on ScanCode cache misses.
- Scan-target mode builds a local ScanCode Docker image from the bundled
reference/scancode-toolkitsubmodule automatically when the matching image is missing and a ScanCode run is required. - Direct-json mode only requires readable input JSON files.
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --help
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --repo-url https://github.com/org/repo.git --repo-ref main --profile common
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --repo-url https://github.com/org/repo.git --repo-ref <sha> --profile common --build-mode fast-iteration
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --target-path /path/to/local/directory --profile common-with-compiled
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --repo-url https://github.com/org/repo.git --repo-ref v1.2.3 --profile licenses
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --target-path /path/to/local/directory --profile packages
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --repo-url https://github.com/org/repo.git --repo-ref <sha> --profile common
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --target-path /path/to/local/directory --profile common
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --target-path /path/to/local/directory -- --license --package --strip-root
cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --scancode-json /path/to/scancode.json --provenant-json /path/to/provenant.jsonCLI arguments:
- Choose exactly one input mode: either
--repo-url, one-or-more--target-pathvalues, or the pair--scancode-json+--provenant-json. --repo-url URL: compare the given repository URL via the shared repo cache.--repo-ref REF: required with--repo-url; commit SHA, tag, or branch to resolve and compare.--target-path PATH: compare an existing local directory in place, or repeat the flag to stage multiple local files into one compare run.--scancode-json PATH: compare an existing ScanCode output JSON directly, without rerunning ScanCode.--provenant-json PATH: compare an existing Provenant output JSON directly, without rerunning Provenant.--scancode-cache-identity ID: optional with--target-path; opt in to shared ScanCode cache reuse for a caller-asserted local snapshot identity.--build-mode optimized|fast-iteration: choose how scan-target mode builds the local Provenant binary. The defaultoptimizedmode usescargo build --releasefor benchmark-grade timing.fast-iterationusescargo build --profile ci-releasefor quicker output-focused reruns while you are still triaging changes.--profile common: convenience shorthand for-clupe --system-package --strip-root --processes 4.--profile common-with-compiled: convenience shorthand for-clupe --system-package --package-in-compiled --strip-root.--profile licenses: convenience shorthand for-l --strip-root.--profile packages: convenience shorthand for-p --strip-root.- In scan-target mode, pass either a supported
--profileor explicit shared scan flags after--. - Direct-json mode compares the provided JSON files as-is and does not accept
--profileor explicit scan flags after--. - Build mode only matters in scan-target mode because direct-json mode compares already-produced raw outputs without rebuilding Provenant.
- Compare reduction still applies compatibility-aware in-memory normalization for known intentional differences such as Provenant's raw-default file-level copyright rendering, so parity review does not fail noisily just because Provenant preserved
All rights reservedor punctuation in the saved JSON.
- Creates a per-run artifact directory under
.provenant/compare-runs/. - In scan-target mode, either scans the local directory in place or resolves
--repo-url+--repo-refthrough a shared repo cache. - In scan-target mode, builds Provenant in the selected build mode (
optimizedby default viacargo build --release;fast-iterationviacargo build --profile ci-release). - In scan-target mode, updates or creates a shared shallow repo cache under
.provenant/repo-cache/, fetches only the requested ref at depth 1, resolves it to a full commit SHA, and materializes a detached checkout for the run. - In scan-target mode, resolves the ScanCode runtime identity and, on cache misses, ensures a local Docker-backed ScanCode runtime exists by building the image from
reference/scancode-toolkitif needed. - In scan-target mode, reuses cached ScanCode raw artifacts when available, otherwise runs ScanCode alongside Provenant with the same shared scan profile and ephemeral license-cache directories.
- In direct-json mode, copies the provided raw JSON files into the run artifact directory without rerunning either scanner.
- Saves raw outputs and logs under
raw/. - Produces reduced comparison artifacts under
comparison/and prints the absolute artifact paths at the end.
Each run writes artifacts under:
.provenant/compare-runs/<run-id>/
Core files:
run-manifest.jsonraw/scancode.jsonraw/provenant.jsoncomparison/summary.jsoncomparison/summary.tsvcomparison/samples/*.json
comparison/summary.json includes a neutral comparison_status. review_required
means the reduced compare found deltas that still need reviewer or agent
classification; it is not itself a regression verdict. The summary also exposes
directional signal buckets under comparison_signal_summary plus top-level
directional maps for the ScanCode-favored and Provenant-favored sides.
Alongside the identity-only package_data bucket, the summary also reports a
package_field_content axis (mirrored in package_field_content_summary and
the package_field_content_value_differences.json sample). For packages that
both outputs agree exist (matched by purl, else the
type|name|version|datasource_id fallback, across both top-level packages[]
and file-level files[].package_data[]), it diffs the content of
declared_license_expression, declared_license_expression_spdx, and holder.
This catches a change that drops or corrupts declared-license/holder content
even when every package identity still matches. Each difference is counted in
exactly one of three reconcilable buckets: missing_in_provenant (content only
on the ScanCode side), extra_in_provenant (content only on the Provenant
side), and value_vs_value_mismatch (both sides carry content but the values
differ), so missing + extra + value_vs_value_mismatch always equals the total
number of differences and sum(by_field.values()). Because it compares content
for the first time, enabling it can surface pre-existing ScanCode-vs-Provenant
declared-license deltas that have nothing to do with any recent change; that is
expected, valuable parity signal to classify, not automatically a regression.
comparison/samples/field_value_frequency.json is a cross-file frequency
rollup of the per-file value diffs in file_metric_value_differences.json. For
each field (authors, holders, copyrights, license expressions/clues, etc.) it
aggregates the value-level differences across all common-path files and
ranks them by total occurrence count, so systematic patterns become visible
(for example one author value appearing hundreds of times instead of as
hundreds of scattered one-count entries). It is framed by direction
(extra_in_provenant = PV-only, extra_in_scancode = SC-only); PV-extra is
not inherently junk, since much of it is legitimate source-faithful or
richer output, so the rollup is a triage aid and the reviewer decides
junk-vs-advantage. It is a pure diagnostic: it never feeds comparison_status,
the signal counts, or any pass/fail gate. summary.json also surfaces a short
per-field top-N preview under field_value_frequency_top.
Optional diagnostic logs when available:
raw/scancode-stdout.txtraw/provenant-stdout.txt
- The command keeps the full raw scanner outputs; it does not stream giant machine-readable payloads to stdout.
- Stdout is reserved for progress, a reduced summary table, and the saved artifact paths.
- ScanCode currently runs via Docker on all platforms for this workflow because that is the reproducible runtime path verified in this repository.
- Direct-json mode skips Docker, Provenant binary builds, cache preparation, and scanner execution. It compares the provided raw JSON files exactly as supplied.
- Use the default
--build-mode optimizedwhenever the run's timing may be recorded indocs/BENCHMARKS.mdor compared seriously against ScanCode wall-clock numbers. - Use
--build-mode fast-iterationwhen you only need a quicker compare rerun to see whether a code change affected output. Before recording benchmark timing or refreshing a benchmark entry, rerun the target in the default optimized mode. - When ScanCode caching is enabled,
compare-outputsprewrites the cache manifest before the Docker run starts. If an outer wrapper times out after ScanCode has begun but the container keeps running, wait for that container first and only salvage the finishedraw/scancode.json(andraw/scancode-stdout.txtwhen present) into the printed ScanCode cache directory whendocker waitexits0. If the container exits non-zero, do not populate the cache from that run. - When
compare-outputsactually executes either scanner, it disables persistent license-cache reuse for fairness: Provenant runs with--no-license-index-cache, and ScanCode uses container-local ephemeral cache directories. compare-outputspasses the same shared scan args to both scanners. Thecommonprofile includes installed package database coverage and fixes the shared worker count at--processes 4, which is usually a no-op on ordinary source repositories but matters for extracted rootfs/container trees and other artifact targets. Usecommon-with-compiledwhen you also want Go/Rust compiled-binary package extraction in the shared scan profile.- The shared worker count is a fixed, single-sourced constant (
BENCHMARK_SCAN_PROCESSESinxtask/src/common.rs), not a host-derived value. This matters for fairness and cross-row comparability: ScanCode is markedly process-count sensitive, so an implicit or per-host worker count would silently make rows recorded on different hosts or at different times non-comparable. Keep both scanners on the same fixed count, and treat any change to that count as a deliberate full re-measurement of the benchmark matrix (serial, many hours) rather than a casual edit. Override it ad hoc for a one-off run by passing explicit scan flags (including your own--processes <n>) after--instead of--profile common; that does not change the recorded standard. - Direct-json mode infers whether info/classify/summary-style sections should be compared from the JSON contents themselves, so it does not require replaying the original scan flags just to unlock those diffs.
- For
--profile common, the ScanCode Docker invocation also adds--memory 12g --memory-swap 12g, and that runtime cap is part of ScanCode cache validation. --repo-urlmode requires--repo-ref; the command records both the requested ref and the resolved full commit SHA inrun-manifest.json.- In scan-target mode,
run-manifest.jsonalso records the Provenant binary version plus the current Provenant repository revision, dirty state, and diff hash, alongside the ScanCode runtime identity. Direct-json mode records placeholder command/runtime fields because it compares the supplied raw JSON files without executing either scanner. - Repo URL runs reuse cached git objects from
.provenant/repo-cache/, fetch only the requested ref shallowly, and remove the temporary detached checkout after the run so compare artifacts do not retain duplicate full repository trees. - Repo URL runs also reuse cached raw ScanCode artifacts from
.provenant/scancode-cache/when the resolved target commit, ScanCode runtime identity, and effective ScanCode scan args are unchanged. - Local
--target-pathruns rerun ScanCode by default. Pass--scancode-cache-identity <id>to opt into shared ScanCode raw-artifact reuse for a local snapshot you have identified explicitly. - For local target-path cache hits, the path itself is not the cache identity. The cache key is derived from your explicit
--scancode-cache-identityplus the effective ScanCode runtime/args, so reusing the same identity across different local paths will intentionally hit the same cache entry when the staged snapshot is meant to be the same. - For local target-path cache hits, keep all of these stable between runs: the explicit
--scancode-cache-identity, the effective ScanCode scan args/profile, the ScanCode runtime identity (image, runtime revision/dirty state/diff hash, Docker platform), and the ScanCode memory limits thatcompare-outputsapplies. - Treat
--scancode-cache-identityas a caller-owned snapshot label, not a convenience name. If the local artifact contents changed in any meaningful way, bump the identity yourself; xtask validates the manifest/runtime/args, but it does not hash local target contents for you. - Repeated
--target-pathvalues currently support files only and are mainly intended for multi-input--from-jsonreplay compares. The harness stages those files under one temporary input directory and passes them explicitly to both scanners in the same order. - For repeated local
--target-pathfile inputs, cache hits also depend on keeping the same file list order, because xtask stages them as ordered numbered filenames under one temporary input tree before invoking ScanCode. - When shared scan args reference a local auxiliary file for both scanners (for example
--license-policy <path>,--license-rules-path <path>, or--custom-template <path>),compare-outputsstages that file separately and rewrites the ScanCode Docker path automatically so both scanners see the same auxiliary input. - Cache hits now require a cached
scancode.jsonplus cachemanifest.json;scancode-stdout.txtis reused when available but is no longer required for cache completeness. - A quick target-path rerun checklist for expected ScanCode cache hits is: same
--scancode-cache-identity, same--profileor explicit scan args, same auxiliary inputs, same local file order when repeating--target-path, and no ScanCode runtime change since the cache was written. scancode-stdout.txtandprovenant-stdout.txtare best-effort diagnostic logs. The compare pipeline only requires the JSON outputs, so a log-write failure no longer makes the command fail.- Path-only buckets in
comparison/summary.json,comparison/summary.tsv, andcomparison/samples/*.jsondescribe final output membership, not proven scan coverage. When the compared outputs used--only-findings, a path shown only on one side can simply mean the other side scanned it but filtered it away because no findings remained. - The command adds Git control-path ignore rules (
.git, nested.git, and their contents) on the ScanCode side so repository metadata does not dominate the comparison artifacts without hiding package-adjacent files such as.gitmodules. Provenant already excludes.gitdirectories during path collection by default, so xtask does not need to restate those ignores for Provenant. The harness no longer injects a blankettarget/*ignore because some upstream repositories legitimately usetarget/as source content.
perf-ab answers "did my code change actually make Provenant faster?". It
builds two Provenant git refs in release mode and interleaved-A/B-times a
scan against the same target, reporting per-side medians and the head-vs-base
speedup.
This is a self before/after comparison and is intentionally different from
benchmark-target and compare-outputs, which measure Provenant against
ScanCode. perf-ab never runs ScanCode and its timings do not belong in
docs/BENCHMARKS.md; it is for attributing a wall-clock change to your own
edit. The companion benchmark-perf-change skill describes the full
profile-first, regression-guarded workflow this command supports.
cargo run --manifest-path xtask/Cargo.toml --bin perf-ab -- --help
cargo run --manifest-path xtask/Cargo.toml --bin perf-ab -- --base origin/main --head HEAD --target-path /path/to/repo -- -c -n 1
cargo run --manifest-path xtask/Cargo.toml --bin perf-ab -- --base origin/main --head HEAD --repo-url https://github.com/org/repo.git --repo-ref <sha> --rounds 7 --profile licenses
cargo run --manifest-path xtask/Cargo.toml --bin perf-ab -- --base <ref> --head <ref> --target-path /path/to/repo --check-output -- -clupe -n 1
cargo run --manifest-path xtask/Cargo.toml --bin perf-ab -- --base-bin ./base/provenant --head-bin ./head/provenant --target-path /path/to/repo -- -c -n 1CLI arguments:
--base REF: base git ref to build and time (the "before" side). Defaultorigin/main.--head REF: head git ref to build and time (the "after" side). DefaultHEAD.- Choose exactly one target: either
--repo-url(with--repo-ref) via the shared repo cache, or--target-path PATHfor a local directory scanned in place. --repo-ref REF: required with--repo-url; commit SHA, tag, or branch of the target repo.--rounds N: number of interleaved timed rounds per side after the discarded warmup. Default5.--base-bin PATH/--head-bin PATH: use a prebuilt binary for that side and skip building the ref.--check-output: scan with both binaries to JSON and assert byte-identical output after normalizing the volatile header and randomly-generated package UIDs. The correctness gate for pure-refactor perf changes.--profile common|common-with-compiled|licenses|packages: scan-flag shorthand shared by both sides. Mutually exclusive with explicit flags after--.- Pass either a supported
--profileor explicit scan flags after--(for example-- -c -n 1). The same flags are forwarded to both binaries.
- Resolves the scan target: either a local
--target-pathscanned in place, or--repo-url+--repo-refmaterialized through the shared repo cache under.provenant/repo-cache/and removed after the run. - Builds each ref in release (
cargo build --release --bin provenant) into a stable per-ref worktree under.provenant/perf-ab-build/, backed by a dedicated bare clone of the current repo so the developer's working tree is untouched.--base-bin/--head-binskip the build for that side. - Optionally runs the
--check-outputcorrectness gate before timing. - Runs one discarded warmup scan per binary, then
--roundsinterleaved timed scans (base, head, base, head, …) so machine-wide drift is shared evenly. - Reports per-round real times, the per-side median, and the speedup as both a percent reduction and a factor (a factor below
1.0is flagged as a regression). - Writes a run manifest under
.provenant/perf-ab/<run-id>/run-manifest.jsonrecording both refs, their resolved build revisions, the binaries used, the per-round timings, and the computed speedup.
Each run writes artifacts under:
.provenant/perf-ab/<run-id>/run-manifest.json.provenant/perf-ab/<run-id>/<side>-<phase>-scan.json(scratch scan outputs from timed runs)
- Both sides always run with
--no-license-index-cacheso a warmed license index from one round does not skew the next, matchingbenchmark-targetfairness. - The warmup run is discarded; only the
--roundsinterleaved timed runs feed the median. - Build each ref in release on a stable checkout; do not point
--base-bin/--head-binat debug or differently-profiled binaries, which would make the comparison meaningless. --check-outputnormalizes the volatile top-levelheadersblock (version, timestamps, duration) and randomly-generated packageuuid=UIDs (which differ on every run, so without this--package/-clupeprofiles could never pass) before comparing; any remaining diff fails the run, because a behavior-preserving perf change must be byte-identical.- Building two release binaries is slow. Use
--base-bin/--head-binwhen you already have both binaries, or--repo-ref/refs that share most compiled artifacts.
update-parser-golden regenerates parser .expected.json fixtures directly from current Rust parser output.
Show CLI help:
cargo run --manifest-path xtask/Cargo.toml --bin update-parser-golden -- --helpCLI arguments:
<ParserType>: parser struct name (for exampleNpmParser)<input_file>: fixture input file to parse<output_file>:.expected.jsonfile to write--list: list all registered parser types
Example:
cargo run --manifest-path xtask/Cargo.toml --bin update-parser-golden -- <ParserType> <input_file> <output_file>update-parser-golden rewrites a fixture's whole package object and only emits a
single package, so it cannot refresh multi-package goldens or goldens that wrap
the package in a richer document shape. For those, and to refresh only the
fields the central post-extraction step owns (declared_license_expression,
declared_license_expression_spdx, license_detections, holder) while
leaving the rest of the fixture byte-stable, run the parser golden suite with
the in-place refresh switch:
PROVENANT_UPDATE_PARSER_GOLDEN=1 cargo test --features golden-tests --test parsers_goldenThe switch makes the golden comparators surgically patch those fields in place and report success without comparing; unset, the comparators behave normally. Always review the resulting diff before committing.
update-copyright-golden syncs and updates copyright golden YAML fixtures (authors / ics / copyrights).
Show CLI help:
cargo run --manifest-path xtask/Cargo.toml --bin update-copyright-golden -- --helpCLI arguments:
<authors|ics|copyrights>: fixture suite to process--list-mismatches: print files where Python reference expectations differ from current Rust detector output (parity precheck)--show-diff: print missing/extra summary for those Python-reference parity mismatches (plus samples with--filter)--filter PATTERN: limit processing to paths containingPATTERN--sync-actual: write expected values from current Rust detector output--write: apply file updates (without it, command is dry-run)
ics here refers to the Android Ice Cream Sandwich (Android 4.0) fixture corpus from ScanCode reference tests.
Important distinction: this command is a maintenance/sync tool. Golden tests compare Rust detector output to local Rust-owned fixture YAMLs; --list-mismatches compares Rust detector output to Python reference expectations to decide whether a sync is parity-safe. This remains detector-level parity work; the newer file-level output rendering difference is handled separately in output and compare tests.
Expected workflow:
-
Check Python-reference parity impact first:
cargo run --manifest-path xtask/Cargo.toml --bin update-copyright-golden -- copyrights --list-mismatches --show-diff
-
If parity is acceptable for a fixture, sync from Python reference:
cargo run --manifest-path xtask/Cargo.toml --bin update-copyright-golden -- copyrights --filter <pattern> --write
-
If divergence is intentional or Rust-specific, update to Rust actuals:
cargo run --manifest-path xtask/Cargo.toml --bin update-copyright-golden -- copyrights --sync-actual --filter <pattern> --write
update-license-golden syncs and updates license golden YAML fixtures.
Show CLI help:
cargo run --manifest-path xtask/Cargo.toml --bin update-license-golden -- --helpCLI arguments:
--list-mismatches(--list-diffsalias): print files where Python reference expectations differ from current Rust detector output (parity precheck)--show-diff: print detailed diff for those mismatches--filter PATTERN: limit processing to paths containingPATTERN--suite SUITE: process only one suite (lic1, lic2, lic3, lic4, external, unknown)--sync-actual: write expected values from current Rust detector output--write: apply file updates (without it, command is dry-run)
Expected workflow:
-
Check Python-reference parity impact first:
cargo run --manifest-path xtask/Cargo.toml --bin update-license-golden -- --list-mismatches --show-diff
-
If parity is acceptable for a fixture, sync from Python reference:
cargo run --manifest-path xtask/Cargo.toml --bin update-license-golden -- --suite lic1 --filter <pattern> --write
-
If divergence is intentional or Rust-specific, update to Rust actuals:
cargo run --manifest-path xtask/Cargo.toml --bin update-license-golden -- --sync-actual --suite unknown --filter <pattern> --write
validate-urls systematically validates all URLs in production documentation and Rust docstrings.
Manual run:
cargo run --quiet --manifest-path xtask/Cargo.toml --bin validate-urls -- --root .Exit codes:
0: all URLs valid1: some URLs failed validation
This command is informational in CI and does not block PRs.
generate-supported-formats regenerates docs/SUPPORTED_FORMATS.md from parser metadata.
Examples:
cargo run --manifest-path xtask/Cargo.toml --bin generate-supported-formats
cargo run --manifest-path xtask/Cargo.toml --bin generate-supported-formats -- --checkgenerate-output-field-reference regenerates the checked-in public output field reference from semantic metadata stored under src/output_schema/.
Examples:
cargo run --manifest-path xtask/Cargo.toml --bin generate-output-field-reference
cargo run --manifest-path xtask/Cargo.toml --bin generate-output-field-reference -- --checkgenerate-serve-openapi regenerates the checked-in OpenAPI document for the current provenant serve API surface.
Examples:
cargo run --manifest-path xtask/Cargo.toml --bin generate-serve-openapi
cargo run --manifest-path xtask/Cargo.toml --bin generate-serve-openapi -- --checkgenerate-benchmark-chart regenerates docs/scan-duration-vs-files.svg and refreshes the headline benchmark summary stats in docs/BENCHMARKS.md from the benchmark timing rows in that document.
The command computes and prints:
- number of recorded runs
- number of runs where Provenant is faster
- median speedup
- geometric-mean speedup
- median speedup on sub-100-file targets
- median speedup on 10k+-file targets
--check verifies both the checked-in SVG and the BENCHMARKS headline summary line.
Examples:
cargo run --manifest-path xtask/Cargo.toml --bin generate-benchmark-chart
cargo run --manifest-path xtask/Cargo.toml --bin generate-benchmark-chart -- --checkgenerate-index-artifact regenerates the embedded license index artifact from ScanCode rules and licenses.
The generated artifact reflects the checked-in build policy at
resources/license_detection/index_build_policy.toml, so policy changes should
be committed alongside the regenerated license_index.zst artifact.
Downstream add/replace overlays live as regular .RULE / .LICENSE files under
resources/license_detection/overlay/, and the generated artifact embeds
provenance that surfaces in structured scan output headers.
If upstream absorbs one of these local curations, artifact generation now fails
fast so maintainers remove the redundant ignore/overlay instead of silently
shipping dead policy.
Examples:
cargo run --manifest-path xtask/Cargo.toml --bin generate-index-artifact
cargo run --manifest-path xtask/Cargo.toml --bin generate-index-artifact -- --checkclassify-rule-overmatch scores every upstream ScanCode rule against the same
overmatch-risk signals the bundled overlays in
resources/license_detection/index_build_policy.toml already encode, then ranks
the rules that share a systematic root cause but are not yet covered by an
overlay. It is a read-only reporting tool: it never edits the dataset or policy.
The risk classes mirror the existing overlay clusters:
BareWeakWord: short bare GPL-family shorthand mapped to an unversioned bucket (GPL →gpl-1.0-plus, LGPL →lgpl-2.0-plus, AGPL →agpl-3.0-plus); too weak to assert a concrete license, so a clue-only treatment fits.VersionMismatch: a short notice whose expression asserts a specific elevated GPL/LGPL/AGPL version but whose text carries no matching version anchor, so it can overmatch a neighbouring version.BareReferencedFilename: a short notice that asserts a license via a bare COPYING/LICENSE reference with no independent version anchor.BsdEndorsement: a shortbsd-newtext rule that contains the endorsement clause but is neither continuous nor full coverage.
Rules already guarded by an inline required phrase ({{...}}), and false
positive, clue, and deprecated rules, are skipped.
Use it to triage candidates before curating overlays; the highest-signal,
systematic class in practice is BareWeakWord. Treat the other classes as
review candidates rather than blanket-apply targets: many of the flagged rules
encode ScanCode's deliberate unversioned-bucket mappings, and demoting them
would trade false positives for recall.
Examples:
# Text report ranked by risk score (defaults to the bundled ScanCode corpus).
cargo run --manifest-path xtask/Cargo.toml --bin classify-rule-overmatch
# Machine-readable output for further analysis.
cargo run --manifest-path xtask/Cargo.toml --bin classify-rule-overmatch -- --json