Precise reference for every storage layout: the (x, y) → flat index
mapping, the chunk/storage model, halo semantics, world-size
constraints, and the exact CSV layout string each emits.
Adjacent docs:
docs/workloads.md— per-workload algorithms and the CSV schema layouts feed into;docs/perf/README.md— sanitizer / profile / optimization-record builds for verifying layout codegen (e.g. confirminglx & (B-1)constant-folds).
This document covers mechanics of the implemented layouts only. For
the pipeline of shipped / planned / considered layouts (the source of
truth for what exists), see
roadmap.md → Potential layouts. For
measured numbers, see RESULTS.md.
Source: src/layouts/* (square), src/hex/layouts/* (hex). Registries
(the authoritative name → type mapping) are src/layouts/registry.cpp
and src/hex/registry.cpp.
All layouts model the same world as a structure-of-arrays (one flat
array per field); a layout is purely a permutation of logical tiles
into flat slots. Each satisfies the LayoutT / HexLayoutT concept
(src/layouts/layout.h, src/hex/layout.h), and is dispatched through a
std::variant + std::visit so every workload call monomorphizes on the
concrete type (no virtual dispatch). Key members:
| Member | Meaning |
|---|---|
index(x, y) |
The flat index for logical tile (x, y) (hex: (q, r)). The defining operation. |
chunk_side() |
Chunk side B for chunked layouts; 0 for non-chunked. Span workloads use it to compute chunks_touched; 0 means "not applicable". |
storage_size() |
Flat-array extent. Equals W·H for canonical layouts; larger for persistent-halo layouts. All SoA arrays and any buffer indexed by index() are sized to this. |
populate_halos(buf) |
Fills halo cells from canonical centers (persistent-halo layouts only; no-op otherwise). |
radius_spans_exact, rect_spans |
Emit physical-address-sorted spans for the disk / rectangle. |
stencil8_update / stencil6_update (+ _to) |
The dense-stencil kernel; the _to form takes explicit in/out buffers (used by diffusion). |
row_indices(...) |
Fill a row of indices via stride math (used by flow_direction). |
has_uniform_stride_interior() |
true when chunk interiors admit ±1/±B pointer math; false only for hex_gosper_7. Workloads with chunk-aware fast paths branch on this. |
chunk_id_for_sort(x, y) |
A cache-locality bucket key (diagnostic; not currently used by any shipping workload — see README → Potential optimizations → Tested, rejected). |
- Non-chunked (
row_major*,hex_row_major,hex_gosper_7):chunk_side() == 0. - Canonical chunked / curve: the grid is tiled into
B × Bchunks.index = chunk_id · B² + ly · B + lx, where(lx, ly)is the cell's position inside its chunk andchunk_idorders the chunks. Storage isW·H. The chunk order is what distinguishes the families:- chunked_row_major —
chunk_id = cy · chunks_x + cx(row-major). - morton / hilbert —
chunk_id = Encoder(cx, cy)via a precomputedchunk_id_lut_(soindex()pays one LUT load, not a per-call encode).
- chunked_row_major —
- Persistent halo (
*_halo_*): each chunk is stored as(B+2)²cells — a 1-cell border of halo copies around theB × Bcenters. Center(lx, ly)sits at storage offsetchunk_id · (B+2)² + (ly+1)·(B+2) + (lx+1). Halos hold copies of the adjacent chunks' edge cells, populated once at world init viapopulate_halos, so the stencil kernel reads all 9 neighbors from one chunk's storage with no cross-chunkindex()and no boundary branch. Cost: storage overhead(B+2)²/B²(1.13× at B=32, 1.06× at B=64); halo cells are not addressable throughindex()— only the stencil kernel touches them.
- Morton (Z-order): bit-interleave
xandy. Works for any range. - Hilbert: the standard power-of-two
xy → dencode (Wikipedia / Press et al. reference). Better locality than Morton (consecutive chunk_ids are spatial neighbors), at a higher encode cost — amortized to a LUT load here.
Registered in src/layouts/registry.cpp. All carry grid_kind = "square".
layout string |
chunk_side() |
Index / storage | World-size constraint | Notes |
|---|---|---|---|---|
row_major_dense |
0 | y·W + x |
none | Baseline. Emits exact Euclidean row intervals for radius (same precision the chunked variants get). |
row_major_dense_simd |
0 | y·W + x |
none | Same mapping as row_major_dense; only the stencil8 interior loop differs (Google Highway, runtime ISA dispatch). |
chunked_row_major_32 |
32 | chunk_id·B² + ly·B + lx, chunk_id = cy·chunks_x + cx |
W%32==0, H%32==0 |
Per-call halo scratch in stencil8 (B=32). |
chunked_row_major_64 |
64 | as above, B=64 |
W%64==0, H%64==0 |
|
chunked_row_major_halo_32 |
32 | (B+2)²-per-chunk storage |
W%32==0, H%32==0 |
Persistent halo (1.13× storage). |
chunked_row_major_halo_64 |
64 | (B+2)²-per-chunk storage |
W%64==0, H%64==0 |
Persistent halo (1.06× storage). |
morton_chunked_32 |
32 | curve chunk_id via Morton LUT |
W%32==0, square chunk grid, power-of-two side |
Per-call halo (B=32). |
hilbert_chunked_32 |
32 | curve chunk_id via Hilbert LUT |
same as Morton | Per-call halo (B=32). |
block_hilbert_32 |
32 | identical to hilbert_chunked_32 |
same | Type-distinct stub for the future kernel-filtered variant; algorithmically the clipped Hilbert (a milestone-1 duplicate of hilbert_chunked_32). |
morton_chunked_halo_32 / _64 |
32 / 64 | Morton order + persistent halo | curve constraint + W%B==0 |
|
hilbert_chunked_halo_32 / _64 |
32 / 64 | Hilbert order + persistent halo | same | |
block_hilbert_halo_32 / _64 |
32 / 64 | as hilbert_chunked_halo_* |
same |
Built only with -DLAYOUT_BENCH_CURVE_B_SWEEP=ON; diagnostic only. They
add morton_chunked_{16,64,128}, hilbert_chunked_{16,64,128},
block_hilbert_{16,64,128} (same families, varied chunk side). Boundary
fraction in stencil8 ≈ 4/B; curve LUT footprint = (W/B)² × 8 B.
Registered in src/hex/registry.cpp. The world is a parallelogram
0 ≤ q < W, 0 ≤ r < H in axial coordinates (pointy-top hexes, six
constant neighbor offsets — no parity branch). All carry
grid_kind = "hex".
layout string |
chunk_side() |
Index / storage | World-size constraint | Notes |
|---|---|---|---|---|
hex_row_major |
0 | r·W + q |
none | Byte-for-byte the square row-major formula; only neighbor/distance semantics differ. |
hex_chunked_row_major_32 / _64 |
32 / 64 | chunk_id·B² + lr·B + lq, chunk_id = cr·chunks_q + cq |
W%B==0, H%B==0 |
B×B parallelogram chunks. |
hex_chunked_row_major_halo_32 / _64 |
32 / 64 | (B+2)²-per-chunk storage |
W%B==0, H%B==0 |
Persistent halo. |
hex_morton_chunked_32 |
32 | Morton-ordered chunks (LUT) | curve constraint | Encoder reused from the square side (operates on integer pairs). |
hex_hilbert_chunked_32 |
32 | Hilbert-ordered chunks (LUT) | curve constraint | |
hex_morton_chunked_halo_32 / _64 |
32 / 64 | Morton order + persistent halo | curve + W%B==0 |
|
hex_hilbert_chunked_halo_32 / _64 |
32 / 64 | Hilbert order + persistent halo | same | |
hex_gosper_7 |
0 | per-cell index_table_ lookup (size W·H) |
W%7==0, H%7==0 |
The only hex-native chunk shape: 7-cell Gosper clusters (center + 6 neighbors). Cluster ordering is Hilbert over (cluster_a, cluster_b) as a stand-in for the true Gosper L-system (the L-system needs 7^k cluster counts). has_uniform_stride_interior() == false → workloads fall back to per-cell loops. |
gcd(7, 32) = 1, so no single W satisfies both Gosper-7 (W % 7 == 0)
and the power-of-two-chunk-side requirement of the curve layouts at
B = 32. hex_layout_bench runs each requested (W, H) and silently
skips layouts that throw at construction. Two suggested worlds:
W = 256 for the chunked / curve family, W = 224 (= 32 × 7) for
hex_row_major, hex_chunked_row_major_32, and hex_gosper_7.