Thanks for contributing to Provenant.
This document is the main entry point for contributor workflow. Keep README.md focused on project overview and user-facing setup, and use this file for contributor-specific guidance.
- Read
README.mdfor installation, usage, and the high-level project overview. - Use
docs/DOCUMENTATION_INDEX.mdas the map for the rest of the docs set. - For security issues, follow
SECURITY.mdinstead of opening a public issue or pull request first.
Before running the bootstrap flow, install:
- Git
- A Rust toolchain with
cargoavailable on yourPATH(seerust-toolchain.toml) - Node.js
>=24.0.0withnpmavailable on yourPATH(seepackage.json)
A typical local setup on Linux, macOS, or WSL is:
git clone https://github.com/mstykow/provenant.git
cd provenant
npm run setupThat command runs npm install, installs the Rust CLI helper tools used by local hooks and checks, and then runs ./setup.sh to initialize submodules and hooks.
Useful follow-up commands:
./setup.shto re-run submodule and hook setupnpm run hooks:installto re-install hooks manuallynpm run hooks:runto run the pre-commit hook suite on all filesnpm run check:docsto validate Markdown formatting and linting for documentation changes
These setup and helper commands are currently shell-oriented, so Windows contributors should prefer WSL2.
Before making non-trivial changes, read the docs that own the part of the system you are touching:
docs/ARCHITECTURE.mdfor overall system designdocs/LICENSE_DETECTION_ARCHITECTURE.mdfor license-index and detection internalsdocs/TESTING_STRATEGY.mdfor test layers and command guidancedocs/HOW_TO_ADD_A_PARSER.mdfor parser-specific implementation rulesxtask/README.mdfor maintainer commands such as compare runs, golden maintenance, benchmarks, and generated artifacts
Repo-specific expectations:
- Preserve behavior and parity, especially when using the ScanCode reference as a behavioral spec.
- When a surface intentionally diverges from ScanCode (for example raw-default file-level copyright rendering), preserve and test both the default Provenant contract and the explicit
--compat-mode scancodelane. - Keep parsing static and bounded. Do not execute package-manager code, project code, or shell commands to recover metadata.
- Use
cargo add,cargo remove, and targetedcargo updateinstead of editing Rust dependencies by hand. - Do not add dependencies lightly; make sure they clearly justify their maintenance cost.
- Keep PR scope disciplined. For ecosystem or parser work, prefer one ecosystem family per PR.
Provenant uses the Developer Certificate of Origin (DCO) 1.1 for inbound
contributions. The DCO text is stored in DCO.
Unless a path says otherwise, contributions are accepted under the same license terms that already apply to the material you are changing in this repository. For the main project code that is Apache-2.0. Third-party and reference material kept in-tree continues to use its existing notices and licenses.
Every commit you author must include a Signed-off-by: trailer that matches
your commit author identity. The easiest way to do that is to use:
git commit -sIf you forgot to sign off the latest commit, fix it with:
git commit --amend -s --no-editIf you rewrite or squash commits before merge, make sure the resulting commits still carry the sign-off. The local git hook can catch missing sign-offs early, and the GitHub DCO app enforces PR-level compliance.
Provenant's first-party code and automation files carry SPDX-style license
headers using the Apache-2.0 expression and a Provenant contributors
copyright line.
The header intentionally omits a year so maintainers do not have to rewrite the entire tree every calendar year just to keep boilerplate current.
The current rollout intentionally covers repo-owned, comment-friendly files such
as Rust sources, selected shell scripts, and GitHub workflow/action YAML. It
intentionally does not rewrite paths where prepended text would be
misleading or risky, including reference/**, testdata/**,
resources/license_detection/**, and generated docs such as
docs/SUPPORTED_FORMATS.md.
The scope lives in .license-headers.toml using explicit include and
exclude lists so the checker and hooks share one source of truth.
Source files whose Rust code is derived from the ScanCode Toolkit (ported and
modified) carry dual attribution: an upstream nexB Inc. and others copyright
line and a "Derived from ScanCode Toolkit" change notice, in addition to the
Provenant contributors line (Apache-2.0 section 4). These files are enumerated
in the derived list in .license-headers.toml, which is the authoritative
record and is enforced by the header check.
If you add a new file that ports or adapts ScanCode code, or convert existing
code into such a port, add its path to the derived list and run --fix to
apply the four-line header. Removing a path and rerunning --fix reverts it to
the two-line header. See
tools/license-headers/README.md for the
exact header format and the root NOTICE for the project-level
attribution.
To backfill or repair headers manually, run:
cargo run --quiet --locked --manifest-path tools/license-headers/Cargo.toml -- --fixLefthook checks staged in-scope files without rewriting them, and CI verifies the full allowlist with:
cargo run --quiet --locked --manifest-path tools/license-headers/Cargo.toml -- --checkIf the hook reports a missing header, repair it explicitly with:
cargo run --quiet --locked --manifest-path tools/license-headers/Cargo.toml -- --fixxtask/ remains the default home for Rust-based maintainer workflows that are
intentionally coupled to Provenant internals or to the repo-built provenant
binary.
Small, self-contained tools may live as separate workspace crates under
tools/ when package-boundary isolation materially improves a hot path such as
pre-commit hooks or fast CI checks. The current example is
tools/license-headers/, which stays independent from the heavier
xtask -> provenant-cli dependency graph so routine header checks do not pay
for unrelated release-version rebuilds.
Do not split tools out of xtask/ just for symmetry. Treat a standalone tool
crate as the exception for fast-path, self-contained workflows; keep maintainer
workflows tied to scanner internals, parser metadata, golden maintenance, or
the built CLI in xtask/.
Keep local validation tightly scoped. This repository has many slow and specialized tests, so the default is the smallest command that proves your change.
Prefer:
cargo test --docfor doctestscargo test --test <suite_name>for a top-level integration suitecargo test --lib <filter>for a focused library/parser targetcargo test --features golden-tests <filter>only when the change directly affects golden-covered behavior
Avoid broad local commands such as cargo test, cargo test --all, cargo test --lib, or unfiltered golden suites unless there is no narrower way to validate the change.
Important testing rules:
- Prefer exact test paths or narrowly owned suites over broad substring filters.
- Do not update golden expected files just to make a failing test pass; fix the implementation unless the output change is intentional and documented.
- If you touch fixture-maintenance workflows or generated docs, use the canonical command reference in
xtask/README.md. - All checks must pass before merging, even if CI is the place that runs the full matrix.
For package parser work, treat docs/HOW_TO_ADD_A_PARSER.md as the canonical guide.
That guide covers parser invariants, registration, datasource wiring, assembly integration, test expectations, and validation against the Python reference or an authoritative format spec. It also links back to the project-wide setup and testing docs instead of duplicating them.
- Write commit messages in Conventional Commits format:
type(scope): short summarywhen a scope helps, ortype: short summaryotherwise. - Use the same Conventional Commits format for pull request titles.
- Sign off every commit with
git commit -sso the branch satisfies the DCO policy. - Follow the structure in
.github/pull_request_template.mdand omit sections that do not apply. - Keep summaries focused on why the change exists, not just what changed.
- Keep evergreen contributor and architecture docs under
docs/. - Do not edit generated files such as
docs/SUPPORTED_FORMATS.mdby hand; use the owning generation command. - For release, benchmark, compare-output, and artifact-generation workflows, use
xtask/README.md.
If you are unsure which document owns a topic, start with docs/DOCUMENTATION_INDEX.md.
If you believe you found a security issue, follow SECURITY.md and avoid public disclosure first.