How the moving parts relate, so a reader (or agent) can reason about why a
layout wins a workload — and so new code slots into the right category.
Companion to docs/workloads.md (per-workload mechanics) and
docs/layouts.md (per-layout index/storage). Keep this in sync
when a workload or layout is added (it's a source-of-truth relationship map).
The access pattern — not the game framing — determines which layout property matters. Each group names the layout fast-path it exercises.
| Access pattern | Workloads | Layout property that decides the winner |
|---|---|---|
| Span-emitting scan (emit row intervals over a bbox/disk, sort+coalesce, scan) | radius, rect, weighted_radius, find_first |
Chunk count per query + sort cost. row_major emits monotone spans (cheap); curves permute chunk order → sort. Small chunks (Gosper-7) explode span count. |
| Dense per-tile stencil (every tile reads its N neighbors) | stencil8/stencil6, diffusion, parallel_chunk_update |
Boundary fraction + interior stride. row_major: pure stride. Chunked: per-call index() at chunk edges → halo (*_halo_*) eliminates the boundary path; stencil*_update_to keeps interior on stride math. |
| Sparse per-tile scatter (random dirty list, N neighbors each) | sparse_update, flow_sampling |
Per-call index() cost (no spatial coherence to exploit) + cache scatter. row_major's cheap index() wins; Gosper-7's hex-cluster locality helps where neighbors stay in-cluster. |
| Frontier traversal (BFS / Dijkstra / A* expanding neighbors) | pathfind, connectivity, astar, flow_integration, distance_field |
Per-popped-tile index() cost; working set vs cache. A*'s heuristic shrinks the frontier so index() matters less than in BFS. |
| Line walk (step a ray, early-exit on blocker) | raycast, fov |
Per-step index() + divergent-direction locality. Square Bresenham is integer-cheap; hex lerp+cube-round is compute-bound (the hex fov ~18× cost). FoV's divergent rays are the one square workload curves beat row_major on. |
| Broad-phase / summary (per-chunk bitmap, no per-tile descent) | chunk_query, chunk_summary, hierarchical_radius (bitmap), predicate_scan |
Largely layout-independent (reads a coarse grid or 1-D array). Wins come from data density (bit-packing) / early-out, not layout. |
| Region / HPA (coarse region-graph search) | region_astar, region_astar_full |
Layout-independent (operates on the precomputed region map). Cost is region-graph size + per-segment tile-A* setup. |
Which layouts implement which fast-paths (✓) vs fall back to per-cell
index() (·). Workloads dispatch on has_uniform_stride_interior().
| Layout family | grid | uniform-stride interior | halo variants | chunk-ordered SFC | within-chunk SFC (full-block) |
|---|---|---|---|---|---|
row_major(_simd) / hex_row_major |
both | ✓ | n/a | n/a | n/a |
chunked_row_major_{32,64} (+hex) |
both | ✓ | ✓ _halo_{32,64} |
n/a | n/a |
morton/hilbert_chunked_32 (+hex) |
both | ✓ | ✓ _halo_{32,64} |
✓ | square-only (*_full_block_32) |
sierpinski_chunked_32 |
square | ✓ | ✓ | ✓ | ✓ sierpinski_full_block_32 |
block_hilbert_32 |
square | ✓ | ✓ | = hilbert (dup) | planned |
hex_gosper_7 |
hex | · (per-cell) | planned | cluster-Hilbert | n/a |
Gaps worth noting (see README Potential layouts): hex full-block; Peano / Gray-code (both grids, both forms); per-region/per-workload heterogeneous layouts; AoS-vs-SoA; sparse/skip-empty; triangle / hybrid grids.
Every parallel sibling shares the sequential workload's checksum. Two shapes:
- Per-call internal (one query split across threads by chunk-aligned row
bands):
stencil8/6_parallel,diffusion_parallel,parallel_chunk_update,chunk_summary_parallel,predicate_scan_parallel,curve_order_scan_parallel. - Per-query throughput (T independent queries; reports avg wall/query under
load):
astar_parallel,raycast_parallel,fov_parallel,find_first_parallel,hierarchical_radius_parallel,region_astar(_full)_parallel,chunk_query_parallel,sparse_update_parallel.
Hex shares one generic per-query-throughput harness
(src/hex/workloads/query_throughput_parallel.h). The genre scorer routes a
sequential workload to its sibling via analysis/genres.py
_PARALLEL_SIBLING_MAP; the sibling's CSV variant must match that map's
transform exactly. Metric note: for BOTH shapes, Hz × avg_wall_per_invocation
= wall-seconds/game-second — consistent, so the score captures real BW-scaling
(hex 6-neighbor vs square 8-neighbor) rather than hiding it.
Test coverage (T-invariance guards). The cross-layout equivalence
keystone (test_equivalence, test_hex_equivalence) covers every
registered layout — the hex passes are drift-guarded against
registered_hex_layout_names(), so the six *_halo_* variants are
included. Every parallel sibling above is also guarded for
T-invariance (identical aggregate/checksum ∀ T∈{1,2,4,8}, equal to the
sequential run):
test_equivalence(square):stencil8_parallel,diffusion_parallel,sparse_update_parallel,astar_parallel,raycast_parallel,fov_parallel,find_first_parallel,hierarchical_radius_{naive,bitmap}_parallel,chunk_query_parallel,chunk_summary_parallel,predicate_scan_{byte,bits}_parallel,curve_order_scan_parallel,parallel_chunk_update.test_region:region_astar_parallel,region_astar_full_parallel.test_hex_equivalence:stencil6_parallel,diffusion_parallel, thequery_throughput_parallelharness (Pass C — backs the hex per-query-throughput siblings), plushex_sparse_update_parallelandhex_parallel_chunk_update(internal-split, Pass C sibling).
How far a sibling keeps scaling as --thread-counts grows is set by how many
independent work units it splits into — NOT by hardware alone. Past that
ceiling the curve flattens as a partitioning artifact, not a memory-bandwidth
wall (mis-reading the latter is defect A4):
- Per-call internal band by chunk-rows to keep halo fast paths valid:
effective_threads = min(T, region_rows / max(B,1)). So at a 1024-row region: row_major (B=0, grain 1) → 1024 bands, scales to the host's full thread count;B=32chunked/halo → 32 bands (flat past T32);B=64→ 16 bands (flat past T16). Sub-chunk-row bands would break halo alignment, so this ceiling is by design. Computemin(T, rect_w/B)from the CSV before reading a chunked-layout plateau as saturation; the honest 192-thread stencil curve is the row_major one. - Per-query throughput ceil at the query-pool size. The A* pool is capped by
--astar-query-cap(default 16 → plateau at T16); the bare-metal scaling probe raises it (e.g. 768) on 1-2 layouts so all workers stay busy. Fixed across the T sweep ⇒ aggregate counts stay T-invariant.
Same algorithm shape, grid-adapted neighbors. stencil8↔stencil6,
radius(Euclidean)↔hex_radius(axial), rect↔parallelogram,
4-neighbor BFS↔6-neighbor, Bresenham↔lerp+cube-round. The genre dict is
authored with square names; score_genre._resolve_for_grid maps the hex
analogues (e.g. hex raycast has no angle bucket; hex has no
stencil_ring2_fast → naive stencil_ring2). _EXPECTED_ABSENT suppresses
the benign "other grid's kernel" zero-matches (stencil6 on square, etc.).
row_major wins compute-bound per-tile work (cheap index()); halo
chunked layouts close the dense-stencil gap to ≤1.4×; curves cluster within
±1% of each other (chunk size/shape dominates, not the encoder);
hex_gosper_7 wins hex-topology-native neighbor work and loses span work; and
no single layout wins everything — which is the entire point of the harness.