Skip to content

Latest commit

 

History

History
117 lines (97 loc) · 8.01 KB

File metadata and controls

117 lines (97 loc) · 8.01 KB

Workload × layout relationships

How the moving parts relate, so a reader (or agent) can reason about why a layout wins a workload — and so new code slots into the right category. Companion to docs/workloads.md (per-workload mechanics) and docs/layouts.md (per-layout index/storage). Keep this in sync when a workload or layout is added (it's a source-of-truth relationship map).

1. Workloads grouped by memory-access pattern

The access pattern — not the game framing — determines which layout property matters. Each group names the layout fast-path it exercises.

Access pattern Workloads Layout property that decides the winner
Span-emitting scan (emit row intervals over a bbox/disk, sort+coalesce, scan) radius, rect, weighted_radius, find_first Chunk count per query + sort cost. row_major emits monotone spans (cheap); curves permute chunk order → sort. Small chunks (Gosper-7) explode span count.
Dense per-tile stencil (every tile reads its N neighbors) stencil8/stencil6, diffusion, parallel_chunk_update Boundary fraction + interior stride. row_major: pure stride. Chunked: per-call index() at chunk edges → halo (*_halo_*) eliminates the boundary path; stencil*_update_to keeps interior on stride math.
Sparse per-tile scatter (random dirty list, N neighbors each) sparse_update, flow_sampling Per-call index() cost (no spatial coherence to exploit) + cache scatter. row_major's cheap index() wins; Gosper-7's hex-cluster locality helps where neighbors stay in-cluster.
Frontier traversal (BFS / Dijkstra / A* expanding neighbors) pathfind, connectivity, astar, flow_integration, distance_field Per-popped-tile index() cost; working set vs cache. A*'s heuristic shrinks the frontier so index() matters less than in BFS.
Line walk (step a ray, early-exit on blocker) raycast, fov Per-step index() + divergent-direction locality. Square Bresenham is integer-cheap; hex lerp+cube-round is compute-bound (the hex fov ~18× cost). FoV's divergent rays are the one square workload curves beat row_major on.
Broad-phase / summary (per-chunk bitmap, no per-tile descent) chunk_query, chunk_summary, hierarchical_radius (bitmap), predicate_scan Largely layout-independent (reads a coarse grid or 1-D array). Wins come from data density (bit-packing) / early-out, not layout.
Region / HPA (coarse region-graph search) region_astar, region_astar_full Layout-independent (operates on the precomputed region map). Cost is region-graph size + per-segment tile-A* setup.

2. Layout capability matrix

Which layouts implement which fast-paths (✓) vs fall back to per-cell index() (·). Workloads dispatch on has_uniform_stride_interior().

Layout family grid uniform-stride interior halo variants chunk-ordered SFC within-chunk SFC (full-block)
row_major(_simd) / hex_row_major both n/a n/a n/a
chunked_row_major_{32,64} (+hex) both _halo_{32,64} n/a n/a
morton/hilbert_chunked_32 (+hex) both _halo_{32,64} square-only (*_full_block_32)
sierpinski_chunked_32 square sierpinski_full_block_32
block_hilbert_32 square = hilbert (dup) planned
hex_gosper_7 hex · (per-cell) planned cluster-Hilbert n/a

Gaps worth noting (see README Potential layouts): hex full-block; Peano / Gray-code (both grids, both forms); per-region/per-workload heterogeneous layouts; AoS-vs-SoA; sparse/skip-empty; triangle / hybrid grids.

3. Sequential ↔ parallel sibling pairs

Every parallel sibling shares the sequential workload's checksum. Two shapes:

  • Per-call internal (one query split across threads by chunk-aligned row bands): stencil8/6_parallel, diffusion_parallel, parallel_chunk_update, chunk_summary_parallel, predicate_scan_parallel, curve_order_scan_parallel.
  • Per-query throughput (T independent queries; reports avg wall/query under load): astar_parallel, raycast_parallel, fov_parallel, find_first_parallel, hierarchical_radius_parallel, region_astar(_full)_parallel, chunk_query_parallel, sparse_update_parallel.

Hex shares one generic per-query-throughput harness (src/hex/workloads/query_throughput_parallel.h). The genre scorer routes a sequential workload to its sibling via analysis/genres.py _PARALLEL_SIBLING_MAP; the sibling's CSV variant must match that map's transform exactly. Metric note: for BOTH shapes, Hz × avg_wall_per_invocation = wall-seconds/game-second — consistent, so the score captures real BW-scaling (hex 6-neighbor vs square 8-neighbor) rather than hiding it.

Test coverage (T-invariance guards). The cross-layout equivalence keystone (test_equivalence, test_hex_equivalence) covers every registered layout — the hex passes are drift-guarded against registered_hex_layout_names(), so the six *_halo_* variants are included. Every parallel sibling above is also guarded for T-invariance (identical aggregate/checksum ∀ T∈{1,2,4,8}, equal to the sequential run):

  • test_equivalence (square): stencil8_parallel, diffusion_parallel, sparse_update_parallel, astar_parallel, raycast_parallel, fov_parallel, find_first_parallel, hierarchical_radius_{naive,bitmap}_parallel, chunk_query_parallel, chunk_summary_parallel, predicate_scan_{byte,bits}_parallel, curve_order_scan_parallel, parallel_chunk_update.
  • test_region: region_astar_parallel, region_astar_full_parallel.
  • test_hex_equivalence: stencil6_parallel, diffusion_parallel, the query_throughput_parallel harness (Pass C — backs the hex per-query-throughput siblings), plus hex_sparse_update_parallel and hex_parallel_chunk_update (internal-split, Pass C sibling).

Thread-scaling ceiling (read before interpreting a T sweep)

How far a sibling keeps scaling as --thread-counts grows is set by how many independent work units it splits into — NOT by hardware alone. Past that ceiling the curve flattens as a partitioning artifact, not a memory-bandwidth wall (mis-reading the latter is defect A4):

  • Per-call internal band by chunk-rows to keep halo fast paths valid: effective_threads = min(T, region_rows / max(B,1)). So at a 1024-row region: row_major (B=0, grain 1) → 1024 bands, scales to the host's full thread count; B=32 chunked/halo → 32 bands (flat past T32); B=64 → 16 bands (flat past T16). Sub-chunk-row bands would break halo alignment, so this ceiling is by design. Compute min(T, rect_w/B) from the CSV before reading a chunked-layout plateau as saturation; the honest 192-thread stencil curve is the row_major one.
  • Per-query throughput ceil at the query-pool size. The A* pool is capped by --astar-query-cap (default 16 → plateau at T16); the bare-metal scaling probe raises it (e.g. 768) on 1-2 layouts so all workers stay busy. Fixed across the T sweep ⇒ aggregate counts stay T-invariant.

4. Square ↔ hex analogues

Same algorithm shape, grid-adapted neighbors. stencil8stencil6, radius(Euclidean)↔hex_radius(axial), rect↔parallelogram, 4-neighbor BFS↔6-neighbor, Bresenham↔lerp+cube-round. The genre dict is authored with square names; score_genre._resolve_for_grid maps the hex analogues (e.g. hex raycast has no angle bucket; hex has no stencil_ring2_fast → naive stencil_ring2). _EXPECTED_ABSENT suppresses the benign "other grid's kernel" zero-matches (stencil6 on square, etc.).

5. The trade-off, in one line

row_major wins compute-bound per-tile work (cheap index()); halo chunked layouts close the dense-stencil gap to ≤1.4×; curves cluster within ±1% of each other (chunk size/shape dominates, not the encoder); hex_gosper_7 wins hex-topology-native neighbor work and loses span work; and no single layout wins everything — which is the entire point of the harness.