Skip to content

Latest commit

 

History

History
131 lines (107 loc) · 8.52 KB

File metadata and controls

131 lines (107 loc) · 8.52 KB

Layouts

Precise reference for every storage layout: the (x, y) → flat index mapping, the chunk/storage model, halo semantics, world-size constraints, and the exact CSV layout string each emits.

Adjacent docs: docs/workloads.md — per-workload algorithms and the CSV schema layouts feed into; docs/perf/README.md — sanitizer / profile / optimization-record builds for verifying layout codegen (e.g. confirming lx & (B-1) constant-folds).

This document covers mechanics of the implemented layouts only. For the pipeline of shipped / planned / considered layouts (the source of truth for what exists), see roadmap.md → Potential layouts. For measured numbers, see RESULTS.md.

Source: src/layouts/* (square), src/hex/layouts/* (hex). Registries (the authoritative name → type mapping) are src/layouts/registry.cpp and src/hex/registry.cpp.


What every layout provides

All layouts model the same world as a structure-of-arrays (one flat array per field); a layout is purely a permutation of logical tiles into flat slots. Each satisfies the LayoutT / HexLayoutT concept (src/layouts/layout.h, src/hex/layout.h), and is dispatched through a std::variant + std::visit so every workload call monomorphizes on the concrete type (no virtual dispatch). Key members:

Member Meaning
index(x, y) The flat index for logical tile (x, y) (hex: (q, r)). The defining operation.
chunk_side() Chunk side B for chunked layouts; 0 for non-chunked. Span workloads use it to compute chunks_touched; 0 means "not applicable".
storage_size() Flat-array extent. Equals W·H for canonical layouts; larger for persistent-halo layouts. All SoA arrays and any buffer indexed by index() are sized to this.
populate_halos(buf) Fills halo cells from canonical centers (persistent-halo layouts only; no-op otherwise).
radius_spans_exact, rect_spans Emit physical-address-sorted spans for the disk / rectangle.
stencil8_update / stencil6_update (+ _to) The dense-stencil kernel; the _to form takes explicit in/out buffers (used by diffusion).
row_indices(...) Fill a row of indices via stride math (used by flow_direction).
has_uniform_stride_interior() true when chunk interiors admit ±1/±B pointer math; false only for hex_gosper_7. Workloads with chunk-aware fast paths branch on this.
chunk_id_for_sort(x, y) A cache-locality bucket key (diagnostic; not currently used by any shipping workload — see README → Potential optimizations → Tested, rejected).

Chunk and storage models

  • Non-chunked (row_major*, hex_row_major, hex_gosper_7): chunk_side() == 0.
  • Canonical chunked / curve: the grid is tiled into B × B chunks. index = chunk_id · B² + ly · B + lx, where (lx, ly) is the cell's position inside its chunk and chunk_id orders the chunks. Storage is W·H. The chunk order is what distinguishes the families:
    • chunked_row_majorchunk_id = cy · chunks_x + cx (row-major).
    • morton / hilbertchunk_id = Encoder(cx, cy) via a precomputed chunk_id_lut_ (so index() pays one LUT load, not a per-call encode).
  • Persistent halo (*_halo_*): each chunk is stored as (B+2)² cells — a 1-cell border of halo copies around the B × B centers. Center (lx, ly) sits at storage offset chunk_id · (B+2)² + (ly+1)·(B+2) + (lx+1). Halos hold copies of the adjacent chunks' edge cells, populated once at world init via populate_halos, so the stencil kernel reads all 9 neighbors from one chunk's storage with no cross-chunk index() and no boundary branch. Cost: storage overhead (B+2)²/B² (1.13× at B=32, 1.06× at B=64); halo cells are not addressable through index() — only the stencil kernel touches them.

Curve encoders (src/layouts/curve_encoders.h)

  • Morton (Z-order): bit-interleave x and y. Works for any range.
  • Hilbert: the standard power-of-two xy → d encode (Wikipedia / Press et al. reference). Better locality than Morton (consecutive chunk_ids are spatial neighbors), at a higher encode cost — amortized to a LUT load here.

Square layouts (layout_bench, grid_kind = "square")

Registered in src/layouts/registry.cpp. All carry grid_kind = "square".

layout string chunk_side() Index / storage World-size constraint Notes
row_major_dense 0 y·W + x none Baseline. Emits exact Euclidean row intervals for radius (same precision the chunked variants get).
row_major_dense_simd 0 y·W + x none Same mapping as row_major_dense; only the stencil8 interior loop differs (Google Highway, runtime ISA dispatch).
chunked_row_major_32 32 chunk_id·B² + ly·B + lx, chunk_id = cy·chunks_x + cx W%32==0, H%32==0 Per-call halo scratch in stencil8 (B=32).
chunked_row_major_64 64 as above, B=64 W%64==0, H%64==0
chunked_row_major_halo_32 32 (B+2)²-per-chunk storage W%32==0, H%32==0 Persistent halo (1.13× storage).
chunked_row_major_halo_64 64 (B+2)²-per-chunk storage W%64==0, H%64==0 Persistent halo (1.06× storage).
morton_chunked_32 32 curve chunk_id via Morton LUT W%32==0, square chunk grid, power-of-two side Per-call halo (B=32).
hilbert_chunked_32 32 curve chunk_id via Hilbert LUT same as Morton Per-call halo (B=32).
block_hilbert_32 32 identical to hilbert_chunked_32 same Type-distinct stub for the future kernel-filtered variant; algorithmically the clipped Hilbert (a milestone-1 duplicate of hilbert_chunked_32).
morton_chunked_halo_32 / _64 32 / 64 Morton order + persistent halo curve constraint + W%B==0
hilbert_chunked_halo_32 / _64 32 / 64 Hilbert order + persistent halo same
block_hilbert_halo_32 / _64 32 / 64 as hilbert_chunked_halo_* same

B-sweep variants (opt-in)

Built only with -DLAYOUT_BENCH_CURVE_B_SWEEP=ON; diagnostic only. They add morton_chunked_{16,64,128}, hilbert_chunked_{16,64,128}, block_hilbert_{16,64,128} (same families, varied chunk side). Boundary fraction in stencil84/B; curve LUT footprint = (W/B)² × 8 B.


Hex layouts (hex_layout_bench, grid_kind = "hex")

Registered in src/hex/registry.cpp. The world is a parallelogram 0 ≤ q < W, 0 ≤ r < H in axial coordinates (pointy-top hexes, six constant neighbor offsets — no parity branch). All carry grid_kind = "hex".

layout string chunk_side() Index / storage World-size constraint Notes
hex_row_major 0 r·W + q none Byte-for-byte the square row-major formula; only neighbor/distance semantics differ.
hex_chunked_row_major_32 / _64 32 / 64 chunk_id·B² + lr·B + lq, chunk_id = cr·chunks_q + cq W%B==0, H%B==0 B×B parallelogram chunks.
hex_chunked_row_major_halo_32 / _64 32 / 64 (B+2)²-per-chunk storage W%B==0, H%B==0 Persistent halo.
hex_morton_chunked_32 32 Morton-ordered chunks (LUT) curve constraint Encoder reused from the square side (operates on integer pairs).
hex_hilbert_chunked_32 32 Hilbert-ordered chunks (LUT) curve constraint
hex_morton_chunked_halo_32 / _64 32 / 64 Morton order + persistent halo curve + W%B==0
hex_hilbert_chunked_halo_32 / _64 32 / 64 Hilbert order + persistent halo same
hex_gosper_7 0 per-cell index_table_ lookup (size W·H) W%7==0, H%7==0 The only hex-native chunk shape: 7-cell Gosper clusters (center + 6 neighbors). Cluster ordering is Hilbert over (cluster_a, cluster_b) as a stand-in for the true Gosper L-system (the L-system needs 7^k cluster counts). has_uniform_stride_interior() == false → workloads fall back to per-cell loops.

Hex world-size note

gcd(7, 32) = 1, so no single W satisfies both Gosper-7 (W % 7 == 0) and the power-of-two-chunk-side requirement of the curve layouts at B = 32. hex_layout_bench runs each requested (W, H) and silently skips layouts that throw at construction. Two suggested worlds: W = 256 for the chunked / curve family, W = 224 (= 32 × 7) for hex_row_major, hex_chunked_row_major_32, and hex_gosper_7.