Airframe is where this engine was born. The founding constraint from the original design spec was deliberate and explicit:
"v0 targets one pinned model artifact and fails fast otherwise."
"Preferred target model file:
TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf"
That choice drove the entire design. The engine was built FP32-first (correctness before
optimization), with a single pinned model so there was no ambiguity in what "correct" meant.
Once the CPU reference was provably correct against fixtures, the GPU bindless pipeline was added
and proven to match the CPU reference, resulting in the standalone airframe crate that powers
Shimmy v2.0.
The 0.06 mean absolute logit error between the GPU engine and the CPU reference is the documented tolerance. It is lower precision than llama.cpp's Q8_0-input path — and that is intentional. Airframe is higher precision; the difference comes from llama.cpp voluntarily adding quantization noise to its activations for AVX2 efficiency.
All of these are implemented in airframe/src/core/dequant/ with both a CPU reference and a validated WGSL GPU dequant shader. GPU vs CPU agreement is confirmed per-type via quant_verify.
| Type | GGML ID | Block size | Bytes/block | quant_verify | Notes |
|---|---|---|---|---|---|
F32 |
0 | 1 | 4 | ✅ | Raw floats — maximum precision |
F16 |
1 | 1 | 2 | ✅ | Half-precision floats |
Q4_0 |
2 | 32 | 18 | ✅ | Original 4-bit format. Used by TinyLlama |
Q8_0 |
8 | 32 | 34 | ✅ | 8-bit activations / high-quality quant |
Q4_K |
12 | 256 (superblock) | 144 | ✅ | K-quant 4-bit. Main weights in Q4_K_M GGUFs |
Q5_K |
13 | 256 (superblock) | 176 | ✅ | K-quant 5-bit. Mixed-precision layers in Q4_K_M/Q5_K_M |
Q6_K |
14 | 256 (superblock) | 210 | ✅ | K-quant 6-bit. Output head and embeddings |
What this means in practice: Any standard Q4_K_M or Q5_K_M GGUF will load and run — those formats use Q4_K, Q5_K, and Q6_K across their layers, all of which are fully supported.
Defined in airframe/src/core/spec.rs. Context window is read from each model's GGUF {arch}.context_length key — no hardcoded limits.
| Architecture | GGUF general.architecture |
Representative model | quant_verify | Status |
|---|---|---|---|---|
| Llama | "llama" |
TinyLlama 1.1B, Llama-3.2-1B/3B | ✅ | Fully validated — deterministic inference confirmed |
| Mistral | "mistral" |
Mistral-7B | ✅ (same path as Llama) | Declared; same tensor layout as Llama |
| Phi-2 | "phi2" |
phi-2 | ✅ | GPU math verified |
| Phi-3 | "phi3" |
phi3-mini-4k | ✅ | GPU math verified; fused QKV (attn_qkv.weight) requires code change before full inference |
| Gemma-2 | "gemma2" |
gemma-2-2b-it | ✅ | GPU math verified |
| StarCoder2 | "starcoder2" |
starcoder2-3b | ✅ | GPU math verified |
| GPT-2 | "gpt2" |
gpt2 | ✅ | GPU math verified |
| Qwen2 | "qwen2" |
qwen2-7b-instruct | pending remote GPU | >2 GB GGUF; needs ≥8 GB VRAM |
| Other | any | — | — | Loads if weight names follow blk.N.* convention |
GPU math verified = quant_verify confirmed GPU dequant matches CPU reference for all tensor types in the model.
Fused QKV note (Phi-3/Phi-3.5): These models store blk.N.attn_qkv.weight as a single fused tensor instead of separate attn_q, attn_k, attn_v. Airframe's current metadata loader expects the split form. Full Phi-3 inference support requires a one-pass QKV split during load — tracked on the roadmap.
All supported architectures share the GGUF weight tensor naming convention (blk.N.*). The architecture enum controls how hyperparameters are read from metadata — not the forward pass shader logic.
If you have a GGUF that uses a supported architecture and any of the seven quantization types above, it will load without any code changes.
# Any Llama-family Q4_0 GGUF
SHIMMY_BASE_GGUF=/path/to/Llama-3.2-1B-Instruct-Q4_0.gguf ./shimmy serve
# Any Q4_K_M GGUF (Q4_K_M means most layers are Q4_K, output head is Q6_K)
SHIMMY_BASE_GGUF=/path/to/Mistral-7B-Instruct-v0.2-Q4_K_M.gguf ./shimmy serve
# Extended context (if the model supports it)
SHIMMY_BASE_GGUF=/path/to/model.gguf SHIMMY_MAX_CTX=4096 ./shimmy serveAuto-discovery also finds GGUFs in ~/.cache/huggingface/, ~/.ollama/models/, and any path
listed in SHIMMY_MODEL_PATHS. Only TinyLlama 1.1B has been exhaustively validated against the
Airframe engine; others load on the documented quant/arch support.
Selecting a model for chat: Use the model name reported by shimmy list as the model field
in your OpenAI API request.
The process is mechanical once the GGML spec for the quant type is known.
In airframe/src/core/ggml_types.rs:
pub enum GgmlType {
F32 = 0,
Q4_0 = 2,
Q4_K = 12,
Q6_K = 14,
Q8_0 = 8, // ← add here
}Add the from_u32 match arm, the display name, and the bytes_for_elements calculation.
For Q8_0 specifically:
- Block size: 32 elements
- Layout: 32 ×
i8values + 2-bytef16scale = 34 bytes per block - Dequant:
w[i] = i8_val[i] * scale(simpler than Q4_0 — no nibble unpacking)
Add a new file airframe/src/core/dequant/q8_0.rs following the pattern of q4_0.rs.
The shader reads the packed block, extracts the scale, and writes dequantized F32 values
to the output buffer.
In airframe/src/core/dequant/mod.rs, add a match arm for the new type so the loader
routes it to the correct shader.
Run the conformance harness against a fixture produced by the CPU reference implementation. The acceptance criterion from the airframe expansion strategy:
- Q8_0: expected max logit error ~0.1 vs FP32 reference (near-FP16 quality)
- Q5_0: expected ~5-10% perplexity improvement over Q4_0
Estimated effort per quant type: 3–6 hours based on prior engineering estimates.
- Q8_0 — simplest dequant (no nibble packing), best quality jump, wide model availability
- Q5_0 — 5-bit, midpoint between Q4_0 and Q8_0 quality
- Q5_K — K-quant 5-bit superblock, needed for Q5_K_M GGUFs
- Q2_K — aggressive compression, useful for very large models on limited VRAM
The engine's forward pass is architecture-agnostic at the shader level — it executes matrix multiplies against weights it finds by name. Adding a new architecture means teaching the engine how to find those weights and what hyperparameters to read.
In airframe/src/core/spec.rs, add to ModelArch and its From<&str> match:
pub enum ModelArch {
Llama,
Mistral,
Phi,
Gemma,
Qwen2, // ← new
Other(String),
}Different architectures use different GGUF metadata key prefixes. In spec.rs's from_gguf()
method, ensure the prefix is correctly extracted. Most Llama-family derivatives use identical
keys with their architecture name as the prefix (qwen2.embedding_length, etc.).
If the new architecture uses non-standard GGUF tensor names (not blk.N.*), update the
layer_prefix() method. Llama, Mistral, Phi, and Gemma all use blk so this is only needed
for unusual architectures.
Some architectures differ in their attention variant (sliding window in Mistral, QKNorm in
Gemma 2, etc.). These require additions to airframe/src/family/. For architectures that
are strict Llama-derivatives (Llama 3.x, Mistral, most Qwen2 models), the existing forward
pass handles them without modification.
The engine allocates VRAM at load time based on model weights + KV cache. Planning numbers:
| Model | Quant | Approx GGUF size | KV cache at 2048 ctx | Min VRAM |
|---|---|---|---|---|
| TinyLlama 1.1B | Q4_0 | ~638 MB | ~88 MB | ~800 MB |
| TinyLlama 1.1B | Q8_0 | ~1.1 GB | ~88 MB | ~1.3 GB |
| Llama 3.2 1B | Q4_0 | ~636 MB | ~128 MB | ~900 MB |
| Llama 3.2 3B | Q4_0 | ~1.9 GB | ~448 MB | ~2.5 GB |
| Mistral 7B | Q4_K_M | ~4.1 GB | ~512 MB | ~5 GB |
| Llama 3 8B | Q4_K_M | ~4.7 GB | ~512 MB | ~5.5 GB |
KV cache scales linearly with SHIMMY_MAX_CTX. At 4096 tokens it doubles; at 8192 it quadruples.
At 7B+ with 4096 context, 8 GB VRAM is the practical floor.
Based on current engine state:
Phase 1 (same arch, new quants) — ~1-2 weeks
└─ Q8_0 dequant shader
└─ Q5_0 dequant shader
└─ Validate on TinyLlama Q8_0 GGUF
Phase 2 (same quants, larger models) — days, mostly testing
└─ Llama 3.2 3B Q4_0 validation
└─ Mistral 7B Q4_K_M validation
└─ Update README validated model matrix
Phase 3 (quantization depth) — ~2-3 weeks
└─ Q5_K / Q5_K_M support
└─ Q2_K support (enables very large models at high compression)
Phase 4 (architecture breadth) — depends on arch
└─ Qwen2 explicit support
└─ DeepSeek-specific ops (MLA attention)
└─ MoE dispatch (see docs/AIRFRAME_MOE_ROADMAP.md)
Each phase is independent. Phase 1 (Q8_0) is the highest leverage step — it unlocks better quality on the same models without any architecture work, and the dequant code is simpler to write than Q4_0 was.