Phase 3a: TLS fingerprint allowlist (zero-trust device auth via JA4)

Default-deny gate based on the TLS Client Hello fingerprint. Lets a deployment whose clients are custom-built / parameterised at compile time (one TLS fingerprint per worker / per device class) reject anything that does not match the allowlist before a TLS handshake even completes.

This is **mTLS++**: it complements client-certificate auth by binding to the *TLS stack used by the client*, not just to the possession of a key. An attacker who steals a client cert still has to also replicate the TLS handshake exact (cipher list, extension order, sigalgs, ALPN order, GREASE position, ...) — which is strictly harder than copying a `.key` file.

## Use case (the one that motivated this issue)

> "Il mio cliente vuole solo i propri client. Gli creo il client con fingerprint univoco, tutti diversi per ogni worker. I suoi servizi accettano solo i browser-auth ovunque essi siano; chiunque altro arriva, anche con auth credentials valide, non riesce a farsi erogare informazioni da Zion."

Allowlist of known fingerprints + default deny. Out of scope for this phase: bot-detection by fingerprint *mismatch* against a public browser table — that's Phase 3b's territory and uses a different mechanism (ML scoring, no hard gate).

## Architecture

Hot path:

```text
TCP accept
  ├── peek N bytes from socket (MSG_PEEK via socket2)
  ├── parse TLS record header + ClientHello (`tls-parser` crate)
  ├── compute JA4   (`ja4` crate, FoxIO-LLC standard, ~500 ns)
  ├── early ban check: HashSet<JA4> ∋ fp ?
  │     hit:  socket.close() — RST, NO TLS handshake started
  │     miss: continue
  └── pass stream to rustls (the peeked bytes are still in kernel buffer)

after rustls handshake → HTTP request arrives
  └── log: "tls_fp ja4=t13d1516h2_... worker=acme-042 path=/api/..."
  └── forward to upstream:
        X-Client-Fingerprint:        sha256:HEX                  (mTLS leaf, already shipped in v0.1.7)
        X-Client-TLS-JA4:            t13d1516h2_8daaf6152771_b186095e22b6
        X-Client-TLS-Allowlisted:    acme-worker-042              (the configured `name` of the matched entry)
```

Cost per legit handshake: peek (~100 ns) + JA4 compute (~500 ns) + HashSet lookup (~30 ns) ≈ **~700 ns**. TLS handshake itself is 200–800 µs — added overhead is sub-1%.

Cost per attacker handshake: peek + compute + lookup miss + close ≈ **~700 ns**. Crucially: NO TLS handshake started. Attacker pays zero of Zion's CPU; Zion pays nothing it wouldn't have anyway.

## Config shape

```toml
[tls.fingerprint]
mode = "allowlist"          # off | shadow | allowlist
on_unknown = "drop"         # drop | log_only

[[tls.fingerprint.allowed]]
name = "acme-worker-042"
ja4  = "t13d1516h2_8daaf6152771_b186095e22b6"
allowed_routes  = ["/api/*", "/admin/*"]   # optional: route scoping
rate_limit_rps  = 100                       # optional: per-fingerprint override

[[tls.fingerprint.allowed]]
name = "acme-worker-043"
ja4  = "t13d1517h2_..."
```

Hot-reload: the allowlist hot-swaps via the Phase 1 `ResolvedAppConfig` watcher. Adding a worker is "edit `zion.toml`, save, wait 2 s for the debounce".

## Why JA4 (not JA3, not custom)

* JA3 is end-of-life: Chrome 110+ randomises extension order; the same browser version produces multiple JA3 hashes.
* JA4 (FoxIO-LLC, 2023) is deterministic by construction and explicitly handles the randomisation.
* Crate `ja4` is mature, ~500 LoC, well-maintained.
* JA4 string format is stable, human-readable, and trivial to grep in logs.

## Trade-offs an operator must accept

1. **Distribute an agent, not a stock browser.** A stock Chrome/Firefox/Safari can't be parameterised per worker — its TLS fingerprint is determined by the libraries it links. The custom-fingerprint model requires the cliente to ship a small native agent / tool (think Tailscale agent, Cloudflare Access agent) on each worker laptop. Documented up-front so no one expects "Chrome installed on a laptop = unique fingerprint per employee".
2. **Fingerprint stability vs TLS-stack updates.** Worker agent must pin its TLS library; auto-update of the OS TLS stack would silently change the fingerprint and lock the worker out. Mitigated by static linking on the agent side, not by Zion.
3. **mTLS coexistence.** mTLS continues to work; `[tls.client_ca_path]` and `[tls.client_auth]` are unaffected. Combined, the security is exponentially harder to bypass: attacker needs the cert key AND a TLS stack with the right fingerprint.
4. **io_uring (`--features io-uring-accept`).** The peek-before-rustls pattern needs to land between accept and TLS-acceptor. The uring multishot accept hands a bare TCP stream — the peek happens before. Compatible, but worth verifying with a soak test before claiming support.

## Roadmap (incremental commits)

1. `src/tls_fp.rs` module: peek wrapper around `TcpStream`, `tls-parser` integration, JA4 compute. No config integration yet — pure library code with unit tests against canned ClientHello samples.
2. `[tls.fingerprint]` config + `mode = "shadow"` mode: log + counter (`tls_fp_unknown_total`, `tls_fp_known_total{name="X"}`), no blocking. **Default-off in this commit.** Lets operators run a 7–30 day shadow mode in production before flipping to enforce.
3. `mode = "allowlist"`: socket close on miss, optional in-memory ban map (`DashMap<JA4, banned_until>`) so the second connection from a banned fingerprint short-circuits at peek time.
4. Header forwarding (`X-Client-TLS-JA4`, `X-Client-TLS-Allowlisted`), inbound-header strip on these (same hardening rule as `X-Real-IP` / `X-Client-Cert-Fingerprint` since v0.1.7).
5. Docs (`docs/security/tls-fingerprint.md`) + integration tests.

Total estimated diff: ~700 lines, **+1 dependency** (`ja4` crate). `tls-parser` is already a transitive dep through `rustls`.

## Acceptance criteria

* [ ] `[tls.fingerprint]` config absent or `mode = "off"` → zero overhead, zero peek calls.
* [ ] `mode = "shadow"` → JA4 computed for every handshake, counters `tls_fp_known_total{name="X"}` / `tls_fp_unknown_total` exposed; NO blocking; structured log line per unknown.
* [ ] `mode = "allowlist"` → unknown JA4 → `socket.close()` BEFORE rustls handshake started; `tls_fp_unknown_drops_total` counter bumped.
* [ ] Banned-set short-circuit: a connection whose JA4 was recently rejected exits at peek time (no JA4 recompute, just hash lookup). Configurable TTL (default 10 min).
* [ ] Hot-reload of `[[tls.fingerprint.allowed]]` works via the Phase 1 watcher: add worker, save, ≤ 3 s later that JA4 is accepted.
* [ ] Allowed routes per fingerprint enforced (regex / glob match, same path matcher as `[[route]]`).
* [ ] Per-fingerprint rate-limit override applied in dispatch, isolated from the global per-IP rate-limit map.
* [ ] Inbound `X-Client-TLS-JA4` / `X-Client-TLS-Allowlisted` headers from upstream are stripped before forwarding.
* [ ] Compatible with `--features io-uring-accept` (peek runs before uring's `accept_multishot`'s output is consumed) — verified by a soak test.
* [ ] `cargo build --release --no-default-features` is unaffected (the entire feature is opt-in via cargo feature `tls-fingerprint`).

## Out of scope

* Bot detection by *mismatch* against a curated browser table (UA ↔ JA4) — see #28 (Phase 3b: ML score). The two features can coexist on the same `[tls.fingerprint]` section in the future, but Phase 3a is strictly allowlist-only.
* Distributing the worker-side TLS agent. Zion validates fingerprints; the agent is the customer's deliverable.
* TLS 1.3 ECH (Encrypted Client Hello) handling. JA4 is reasonably stable across ECH variants but the testing matrix is large; track as a separate sub-issue once 3a lands.

## Questions to settle in code review

1. Should the allowlist storage be a `HashSet<String>` (current proposal) or a more compact `HashSet<[u8; 36]>` (raw JA4 bytes)? The string is debugger-friendly, the byte form is faster to compare. Microbenchmark before deciding.
2. Default value of `on_unknown` when `mode = "shadow"`: hard-coded to `log_only` (we are in shadow mode by definition), or configurable? Lean toward hard-coded for fewer foot-guns.
3. Should the boot-time validation refuse to start if `mode = "allowlist"` and the allowlist is empty? Otherwise the daemon silently drops every connection on first boot. Lean YES with a `mode = "allowlist-empty-ok"` opt-out for migration scenarios.

## References

* JA4 spec: https://github.com/FoxIO-LLC/ja4
* `ja4` crate: https://crates.io/crates/ja4
* `tls-parser`: https://crates.io/crates/tls-parser
* mTLS implementation in Zion (v0.1.7): `src/main.rs::spawn_https_handler` — the place where peek would slot in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3a: TLS fingerprint allowlist (zero-trust device auth via JA4) #27

Use case (the one that motivated this issue)

Architecture

Config shape

Why JA4 (not JA3, not custom)

Trade-offs an operator must accept

Roadmap (incremental commits)

Acceptance criteria

Out of scope

Questions to settle in code review

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Phase 3a: TLS fingerprint allowlist (zero-trust device auth via JA4) #27

Description

Use case (the one that motivated this issue)

Architecture

Config shape

Why JA4 (not JA3, not custom)

Trade-offs an operator must accept

Roadmap (incremental commits)

Acceptance criteria

Out of scope

Questions to settle in code review

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions