Workload × layout relationships

How the moving parts relate, so a reader (or agent) can reason about why a layout wins a workload — and so new code slots into the right category. Companion to docs/workloads.md (per-workload mechanics) and docs/layouts.md (per-layout index/storage). Keep this in sync when a workload or layout is added (it's a source-of-truth relationship map).

1. Workloads grouped by memory-access pattern

The access pattern — not the game framing — determines which layout property matters. Each group names the layout fast-path it exercises.

Access pattern	Workloads	Layout property that decides the winner
Span-emitting scan (emit row intervals over a bbox/disk, sort+coalesce, scan)	`radius`, `rect`, `weighted_radius`, `find_first`	Chunk count per query + sort cost. row_major emits monotone spans (cheap); curves permute chunk order → sort. Small chunks (Gosper-7) explode span count.
Dense per-tile stencil (every tile reads its N neighbors)	`stencil8`/`stencil6`, `diffusion`, `parallel_chunk_update`	Boundary fraction + interior stride. row_major: pure stride. Chunked: per-call `index()` at chunk edges → halo (`_halo_`) eliminates the boundary path; `stencil*_update_to` keeps interior on stride math.
Sparse per-tile scatter (random dirty list, N neighbors each)	`sparse_update`, `flow_sampling`	Per-call `index()` cost (no spatial coherence to exploit) + cache scatter. row_major's cheap `index()` wins; Gosper-7's hex-cluster locality helps where neighbors stay in-cluster.
Frontier traversal (BFS / Dijkstra / A* expanding neighbors)	`pathfind`, `connectivity`, `astar`, `flow_integration`, `distance_field`	Per-popped-tile `index()` cost; working set vs cache. A*'s heuristic shrinks the frontier so `index()` matters less than in BFS.
Line walk (step a ray, early-exit on blocker)	`raycast`, `fov`	Per-step `index()` + divergent-direction locality. Square Bresenham is integer-cheap; hex lerp+cube-round is compute-bound (the hex `fov` ~18× cost). FoV's divergent rays are the one square workload curves beat row_major on.
Broad-phase / summary (per-chunk bitmap, no per-tile descent)	`chunk_query`, `chunk_summary`, `hierarchical_radius` (bitmap), `predicate_scan`	Largely layout-independent (reads a coarse grid or 1-D array). Wins come from data density (bit-packing) / early-out, not layout.
Region / HPA (coarse region-graph search)	`region_astar`, `region_astar_full`	Layout-independent (operates on the precomputed region map). Cost is region-graph size + per-segment tile-A* setup.

2. Layout capability matrix

Which layouts implement which fast-paths (✓) vs fall back to per-cell index() (·). Workloads dispatch on has_uniform_stride_interior().

Layout family	grid	uniform-stride interior	halo variants	chunk-ordered SFC	within-chunk SFC (full-block)
`row_major(_simd)` / `hex_row_major`	both	✓	n/a	n/a	n/a
`chunked_row_major_{32,64}` (+hex)	both	✓	✓ `_halo_{32,64}`	n/a	n/a
`morton/hilbert_chunked_32` (+hex)	both	✓	✓ `_halo_{32,64}`	✓	square-only (`*_full_block_32`)
`sierpinski_chunked_32`	square	✓	✓	✓	✓ `sierpinski_full_block_32`
`block_hilbert_32`	square	✓	✓	= hilbert (dup)	planned
`hex_gosper_7`	hex	· (per-cell)	planned	cluster-Hilbert	n/a

Gaps worth noting (see README Potential layouts): hex full-block; Peano / Gray-code (both grids, both forms); per-region/per-workload heterogeneous layouts; AoS-vs-SoA; sparse/skip-empty; triangle / hybrid grids.

3. Sequential ↔ parallel sibling pairs

Every parallel sibling shares the sequential workload's checksum. Two shapes:

Per-call internal (one query split across threads by chunk-aligned row bands): stencil8/6_parallel, diffusion_parallel, parallel_chunk_update, chunk_summary_parallel, predicate_scan_parallel, curve_order_scan_parallel.
Per-query throughput (T independent queries; reports avg wall/query under load): astar_parallel, raycast_parallel, fov_parallel, find_first_parallel, hierarchical_radius_parallel, region_astar(_full)_parallel, chunk_query_parallel, sparse_update_parallel.

Hex shares one generic per-query-throughput harness (src/hex/workloads/query_throughput_parallel.h). The genre scorer routes a sequential workload to its sibling via analysis/genres.py _PARALLEL_SIBLING_MAP; the sibling's CSV variant must match that map's transform exactly. Metric note: for BOTH shapes, Hz × avg_wall_per_invocation = wall-seconds/game-second — consistent, so the score captures real BW-scaling (hex 6-neighbor vs square 8-neighbor) rather than hiding it.

Test coverage (T-invariance guards). The cross-layout equivalence keystone (test_equivalence, test_hex_equivalence) covers every registered layout — the hex passes are drift-guarded against registered_hex_layout_names(), so the six *_halo_* variants are included. Every parallel sibling above is also guarded for T-invariance (identical aggregate/checksum ∀ T∈{1,2,4,8}, equal to the sequential run):

test_equivalence (square): stencil8_parallel, diffusion_parallel, sparse_update_parallel, astar_parallel, raycast_parallel, fov_parallel, find_first_parallel, hierarchical_radius_{naive,bitmap}_parallel, chunk_query_parallel, chunk_summary_parallel, predicate_scan_{byte,bits}_parallel, curve_order_scan_parallel, parallel_chunk_update.
test_region: region_astar_parallel, region_astar_full_parallel.
test_hex_equivalence: stencil6_parallel, diffusion_parallel, the query_throughput_parallel harness (Pass C — backs the hex per-query-throughput siblings), plus hex_sparse_update_parallel and hex_parallel_chunk_update (internal-split, Pass C sibling).

Thread-scaling ceiling (read before interpreting a T sweep)

How far a sibling keeps scaling as --thread-counts grows is set by how many independent work units it splits into — NOT by hardware alone. Past that ceiling the curve flattens as a partitioning artifact, not a memory-bandwidth wall (mis-reading the latter is defect A4):

Per-call internal band by chunk-rows to keep halo fast paths valid: effective_threads = min(T, region_rows / max(B,1)). So at a 1024-row region: row_major (B=0, grain 1) → 1024 bands, scales to the host's full thread count; B=32 chunked/halo → 32 bands (flat past T32); B=64 → 16 bands (flat past T16). Sub-chunk-row bands would break halo alignment, so this ceiling is by design. Compute min(T, rect_w/B) from the CSV before reading a chunked-layout plateau as saturation; the honest 192-thread stencil curve is the row_major one.
Per-query throughput ceil at the query-pool size. The A* pool is capped by --astar-query-cap (default 16 → plateau at T16); the bare-metal scaling probe raises it (e.g. 768) on 1-2 layouts so all workers stay busy. Fixed across the T sweep ⇒ aggregate counts stay T-invariant.

4. Square ↔ hex analogues

Same algorithm shape, grid-adapted neighbors. stencil8↔stencil6, radius(Euclidean)↔hex_radius(axial), rect↔parallelogram, 4-neighbor BFS↔6-neighbor, Bresenham↔lerp+cube-round. The genre dict is authored with square names; score_genre._resolve_for_grid maps the hex analogues (e.g. hex raycast has no angle bucket; hex has no stencil_ring2_fast → naive stencil_ring2). _EXPECTED_ABSENT suppresses the benign "other grid's kernel" zero-matches (stencil6 on square, etc.).

5. The trade-off, in one line

row_major wins compute-bound per-tile work (cheap index()); halo chunked layouts close the dense-stencil gap to ≤1.4×; curves cluster within ±1% of each other (chunk size/shape dominates, not the encoder); hex_gosper_7 wins hex-topology-native neighbor work and loses span work; and no single layout wins everything — which is the entire point of the harness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workload × layout relationships

1. Workloads grouped by memory-access pattern

2. Layout capability matrix

3. Sequential ↔ parallel sibling pairs

Thread-scaling ceiling (read before interpreting a T sweep)

4. Square ↔ hex analogues

5. The trade-off, in one line

FilesExpand file tree

relationships.md

Latest commit

History

relationships.md

File metadata and controls

Workload × layout relationships

1. Workloads grouped by memory-access pattern

2. Layout capability matrix

3. Sequential ↔ parallel sibling pairs

Thread-scaling ceiling (read before interpreting a T sweep)

4. Square ↔ hex analogues

5. The trade-off, in one line