You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Default-deny gate based on the TLS Client Hello fingerprint. Lets a deployment whose clients are custom-built / parameterised at compile time (one TLS fingerprint per worker / per device class) reject anything that does not match the allowlist before a TLS handshake even completes.
This is mTLS++: it complements client-certificate auth by binding to the TLS stack used by the client, not just to the possession of a key. An attacker who steals a client cert still has to also replicate the TLS handshake exact (cipher list, extension order, sigalgs, ALPN order, GREASE position, ...) — which is strictly harder than copying a .key file.
Use case (the one that motivated this issue)
"Il mio cliente vuole solo i propri client. Gli creo il client con fingerprint univoco, tutti diversi per ogni worker. I suoi servizi accettano solo i browser-auth ovunque essi siano; chiunque altro arriva, anche con auth credentials valide, non riesce a farsi erogare informazioni da Zion."
Allowlist of known fingerprints + default deny. Out of scope for this phase: bot-detection by fingerprint mismatch against a public browser table — that's Phase 3b's territory and uses a different mechanism (ML scoring, no hard gate).
Architecture
Hot path:
TCP accept
├── peek N bytes from socket (MSG_PEEK via socket2)
├── parse TLS record header + ClientHello (`tls-parser` crate)
├── compute JA4 (`ja4` crate, FoxIO-LLC standard, ~500 ns)
├── early ban check: HashSet<JA4> ∋ fp ?
│ hit: socket.close() — RST, NO TLS handshake started
│ miss: continue
└── pass stream to rustls (the peeked bytes are still in kernel buffer)
after rustls handshake → HTTP request arrives
└── log: "tls_fp ja4=t13d1516h2_... worker=acme-042 path=/api/..."
└── forward to upstream:
X-Client-Fingerprint: sha256:HEX (mTLS leaf, already shipped in v0.1.7)
X-Client-TLS-JA4: t13d1516h2_8daaf6152771_b186095e22b6
X-Client-TLS-Allowlisted: acme-worker-042 (the configured `name` of the matched entry)
Cost per legit handshake: peek (~100 ns) + JA4 compute (~500 ns) + HashSet lookup (~30 ns) ≈ ~700 ns. TLS handshake itself is 200–800 µs — added overhead is sub-1%.
Cost per attacker handshake: peek + compute + lookup miss + close ≈ ~700 ns. Crucially: NO TLS handshake started. Attacker pays zero of Zion's CPU; Zion pays nothing it wouldn't have anyway.
Config shape
[tls.fingerprint]
mode = "allowlist"# off | shadow | allowliston_unknown = "drop"# drop | log_only
[[tls.fingerprint.allowed]]
name = "acme-worker-042"ja4 = "t13d1516h2_8daaf6152771_b186095e22b6"allowed_routes = ["/api/*", "/admin/*"] # optional: route scopingrate_limit_rps = 100# optional: per-fingerprint override
[[tls.fingerprint.allowed]]
name = "acme-worker-043"ja4 = "t13d1517h2_..."
Hot-reload: the allowlist hot-swaps via the Phase 1 ResolvedAppConfig watcher. Adding a worker is "edit zion.toml, save, wait 2 s for the debounce".
Why JA4 (not JA3, not custom)
JA3 is end-of-life: Chrome 110+ randomises extension order; the same browser version produces multiple JA3 hashes.
JA4 (FoxIO-LLC, 2023) is deterministic by construction and explicitly handles the randomisation.
Crate ja4 is mature, ~500 LoC, well-maintained.
JA4 string format is stable, human-readable, and trivial to grep in logs.
Trade-offs an operator must accept
Distribute an agent, not a stock browser. A stock Chrome/Firefox/Safari can't be parameterised per worker — its TLS fingerprint is determined by the libraries it links. The custom-fingerprint model requires the cliente to ship a small native agent / tool (think Tailscale agent, Cloudflare Access agent) on each worker laptop. Documented up-front so no one expects "Chrome installed on a laptop = unique fingerprint per employee".
Fingerprint stability vs TLS-stack updates. Worker agent must pin its TLS library; auto-update of the OS TLS stack would silently change the fingerprint and lock the worker out. Mitigated by static linking on the agent side, not by Zion.
mTLS coexistence. mTLS continues to work; [tls.client_ca_path] and [tls.client_auth] are unaffected. Combined, the security is exponentially harder to bypass: attacker needs the cert key AND a TLS stack with the right fingerprint.
io_uring (--features io-uring-accept). The peek-before-rustls pattern needs to land between accept and TLS-acceptor. The uring multishot accept hands a bare TCP stream — the peek happens before. Compatible, but worth verifying with a soak test before claiming support.
Roadmap (incremental commits)
src/tls_fp.rs module: peek wrapper around TcpStream, tls-parser integration, JA4 compute. No config integration yet — pure library code with unit tests against canned ClientHello samples.
[tls.fingerprint] config + mode = "shadow" mode: log + counter (tls_fp_unknown_total, tls_fp_known_total{name="X"}), no blocking. Default-off in this commit. Lets operators run a 7–30 day shadow mode in production before flipping to enforce.
mode = "allowlist": socket close on miss, optional in-memory ban map (DashMap<JA4, banned_until>) so the second connection from a banned fingerprint short-circuits at peek time.
Header forwarding (X-Client-TLS-JA4, X-Client-TLS-Allowlisted), inbound-header strip on these (same hardening rule as X-Real-IP / X-Client-Cert-Fingerprint since v0.1.7).
Total estimated diff: ~700 lines, +1 dependency (ja4 crate). tls-parser is already a transitive dep through rustls.
Acceptance criteria
[tls.fingerprint] config absent or mode = "off" → zero overhead, zero peek calls.
mode = "shadow" → JA4 computed for every handshake, counters tls_fp_known_total{name="X"} / tls_fp_unknown_total exposed; NO blocking; structured log line per unknown.
Banned-set short-circuit: a connection whose JA4 was recently rejected exits at peek time (no JA4 recompute, just hash lookup). Configurable TTL (default 10 min).
Hot-reload of [[tls.fingerprint.allowed]] works via the Phase 1 watcher: add worker, save, ≤ 3 s later that JA4 is accepted.
Allowed routes per fingerprint enforced (regex / glob match, same path matcher as [[route]]).
Per-fingerprint rate-limit override applied in dispatch, isolated from the global per-IP rate-limit map.
Inbound X-Client-TLS-JA4 / X-Client-TLS-Allowlisted headers from upstream are stripped before forwarding.
Compatible with --features io-uring-accept (peek runs before uring's accept_multishot's output is consumed) — verified by a soak test.
cargo build --release --no-default-features is unaffected (the entire feature is opt-in via cargo feature tls-fingerprint).
Out of scope
Bot detection by mismatch against a curated browser table (UA ↔ JA4) — see Phase 3b: TLS-fingerprint ML scoring (signal, not gate) #28 (Phase 3b: ML score). The two features can coexist on the same [tls.fingerprint] section in the future, but Phase 3a is strictly allowlist-only.
Distributing the worker-side TLS agent. Zion validates fingerprints; the agent is the customer's deliverable.
TLS 1.3 ECH (Encrypted Client Hello) handling. JA4 is reasonably stable across ECH variants but the testing matrix is large; track as a separate sub-issue once 3a lands.
Questions to settle in code review
Should the allowlist storage be a HashSet<String> (current proposal) or a more compact HashSet<[u8; 36]> (raw JA4 bytes)? The string is debugger-friendly, the byte form is faster to compare. Microbenchmark before deciding.
Default value of on_unknown when mode = "shadow": hard-coded to log_only (we are in shadow mode by definition), or configurable? Lean toward hard-coded for fewer foot-guns.
Should the boot-time validation refuse to start if mode = "allowlist" and the allowlist is empty? Otherwise the daemon silently drops every connection on first boot. Lean YES with a mode = "allowlist-empty-ok" opt-out for migration scenarios.
Default-deny gate based on the TLS Client Hello fingerprint. Lets a deployment whose clients are custom-built / parameterised at compile time (one TLS fingerprint per worker / per device class) reject anything that does not match the allowlist before a TLS handshake even completes.
This is mTLS++: it complements client-certificate auth by binding to the TLS stack used by the client, not just to the possession of a key. An attacker who steals a client cert still has to also replicate the TLS handshake exact (cipher list, extension order, sigalgs, ALPN order, GREASE position, ...) — which is strictly harder than copying a
.keyfile.Use case (the one that motivated this issue)
Allowlist of known fingerprints + default deny. Out of scope for this phase: bot-detection by fingerprint mismatch against a public browser table — that's Phase 3b's territory and uses a different mechanism (ML scoring, no hard gate).
Architecture
Hot path:
Cost per legit handshake: peek (~100 ns) + JA4 compute (~500 ns) + HashSet lookup (~30 ns) ≈ ~700 ns. TLS handshake itself is 200–800 µs — added overhead is sub-1%.
Cost per attacker handshake: peek + compute + lookup miss + close ≈ ~700 ns. Crucially: NO TLS handshake started. Attacker pays zero of Zion's CPU; Zion pays nothing it wouldn't have anyway.
Config shape
Hot-reload: the allowlist hot-swaps via the Phase 1
ResolvedAppConfigwatcher. Adding a worker is "editzion.toml, save, wait 2 s for the debounce".Why JA4 (not JA3, not custom)
ja4is mature, ~500 LoC, well-maintained.Trade-offs an operator must accept
[tls.client_ca_path]and[tls.client_auth]are unaffected. Combined, the security is exponentially harder to bypass: attacker needs the cert key AND a TLS stack with the right fingerprint.--features io-uring-accept). The peek-before-rustls pattern needs to land between accept and TLS-acceptor. The uring multishot accept hands a bare TCP stream — the peek happens before. Compatible, but worth verifying with a soak test before claiming support.Roadmap (incremental commits)
src/tls_fp.rsmodule: peek wrapper aroundTcpStream,tls-parserintegration, JA4 compute. No config integration yet — pure library code with unit tests against canned ClientHello samples.[tls.fingerprint]config +mode = "shadow"mode: log + counter (tls_fp_unknown_total,tls_fp_known_total{name="X"}), no blocking. Default-off in this commit. Lets operators run a 7–30 day shadow mode in production before flipping to enforce.mode = "allowlist": socket close on miss, optional in-memory ban map (DashMap<JA4, banned_until>) so the second connection from a banned fingerprint short-circuits at peek time.X-Client-TLS-JA4,X-Client-TLS-Allowlisted), inbound-header strip on these (same hardening rule asX-Real-IP/X-Client-Cert-Fingerprintsince v0.1.7).docs/security/tls-fingerprint.md) + integration tests.Total estimated diff: ~700 lines, +1 dependency (
ja4crate).tls-parseris already a transitive dep throughrustls.Acceptance criteria
[tls.fingerprint]config absent ormode = "off"→ zero overhead, zero peek calls.mode = "shadow"→ JA4 computed for every handshake, counterstls_fp_known_total{name="X"}/tls_fp_unknown_totalexposed; NO blocking; structured log line per unknown.mode = "allowlist"→ unknown JA4 →socket.close()BEFORE rustls handshake started;tls_fp_unknown_drops_totalcounter bumped.[[tls.fingerprint.allowed]]works via the Phase 1 watcher: add worker, save, ≤ 3 s later that JA4 is accepted.[[route]]).X-Client-TLS-JA4/X-Client-TLS-Allowlistedheaders from upstream are stripped before forwarding.--features io-uring-accept(peek runs before uring'saccept_multishot's output is consumed) — verified by a soak test.cargo build --release --no-default-featuresis unaffected (the entire feature is opt-in via cargo featuretls-fingerprint).Out of scope
[tls.fingerprint]section in the future, but Phase 3a is strictly allowlist-only.Questions to settle in code review
HashSet<String>(current proposal) or a more compactHashSet<[u8; 36]>(raw JA4 bytes)? The string is debugger-friendly, the byte form is faster to compare. Microbenchmark before deciding.on_unknownwhenmode = "shadow": hard-coded tolog_only(we are in shadow mode by definition), or configurable? Lean toward hard-coded for fewer foot-guns.mode = "allowlist"and the allowlist is empty? Otherwise the daemon silently drops every connection on first boot. Lean YES with amode = "allowlist-empty-ok"opt-out for migration scenarios.References
ja4crate: https://crates.io/crates/ja4tls-parser: https://crates.io/crates/tls-parsersrc/main.rs::spawn_https_handler— the place where peek would slot in