7 rounds · 19 experiments · 2026
This repository documents a seven-round investigation into whether alternative computational geometries can improve upon the standard dot-product attention mechanism used in transformer language models.
The starting question was simple: is the way attention works the best way, or just the most obvious way? Every modern LLM uses the same basic trick — compute how similar each word is to every other word using a dot product, and route information accordingly. This study asked what happens when you swap that primitive for something geometrically richer.
The investigation covered five architectural ideas: spectral/frequency representations, Hopfield associative memory theory, p-adic ultrametric geometry, Fourier Neural Operators (FNO), and Grassmannian manifold similarity. Not all of them worked. The ones that didn't are documented here too, because knowing why an idea fails is as useful as knowing why one succeeds.
Standard dot-product attention has a sharp, previously uncharacterised failure boundary: when the number of items to retrieve approaches the embedding dimension (the d/N = 1 threshold), accuracy drops from ~100% to ~20% in a single step. Grassmannian attention — replacing scalar dot-product similarity with subspace overlap similarity — moves this boundary outward by at least a factor of two, with identical parameter cost. A separate thread revealed the likely mechanism behind LLM hallucination: when two entities have similar key representations, attention produces a value-blend biased toward both — a spurious retrieval effect, not a capacity failure. Fourier Neural Operators were precisely characterised: excellent at fixed-domain physics simulation, structurally unable to handle language tasks.
experiments/
round-1/ Spectral diagnostics on GPT-2 weights and attention patterns
round-2/ Compression attempts — can we exploit spectral structure?
round-3/ Hopfield theory — capacity and spurious retrieval
round-4/ FNO, p-adic geometry — first alternative architectures
round-5/ Grassmannian attention — first tests
round-6/ Phase transition characterisation — the core finding
round-7/ FNO on physics tasks — completing the picture
Each experiment folder contains:
exp-name.py— the runnable experiment scriptexp-name.md— what was asked, what was found, and what it means in plain English*-results.json— the raw numerical results
| Experiment | One-line verdict |
|---|---|
| exp-attn-spectrum | Attention patterns ARE spectrally concentrated — but for structural reasons, not semantic ones |
| exp-activation-spectrum | Activations show a U-shaped depth profile; final layer is anomalously compressed |
| exp-svd-rank | Weight matrices are spectrally flat — no compression structure exists |
| Experiment | One-line verdict |
|---|---|
| exp-attn-mask-baseline | The spectral pattern comes from learned head geometry, not the causal mask |
| exp-attn-fft-perplexity | FFT compression fails catastrophically — the signal lives in the high-frequency components |
| Experiment | One-line verdict |
|---|---|
| exp-hopfield-capacity | Capacity is not the problem — a GPT-2 head has room for 10¹³ memories |
| exp-attn-spurious-retrieval | Spurious retrieval is real — similar keys produce value-blended outputs; this is the hallucination mechanism |
| Experiment | One-line verdict |
|---|---|
| exp-fno-hierarchy | FNO is 6× slower than attention on hierarchical retrieval — wrong tool |
| exp-fno-recall | FNO outperforms attention on 2-hop recall — because attention has a structural weakness there |
| exp-padic-geometry | Euclidean space can represent tree distances perfectly when d ≥ N |
| exp-padic-attention | Oracle p-adic geometry adds nothing — the failure is in learning, not geometry |
| Experiment | One-line verdict |
|---|---|
| exp-grassmann-hierarchy | Grassmannian converges 2.5× faster on hierarchical retrieval |
| exp-grassmann-recall | Grassmannian is slightly worse on 2-hop recall — identifies a limitation |
| exp-grassmann-k | Increasing subspace dimension k doesn't help 2-hop tasks — the problem is elsewhere |
| Experiment | One-line verdict |
|---|---|
| exp-depth-sweep | Sharp failure at d/N = 1; Grassmannian spans the transition at 97.3% vs standard's 19.1% |
| exp-depth-params | The advantage is geometric, not parametric — confirmed with matched parameter counts |
| Experiment | One-line verdict |
|---|---|
| exp-fno-pde | FNO wins on 1D heat equation (rel-L2 = 0.0078 vs 0.0108) — learnable frequency weights are key |
| exp-fno-periodic | FNO completely fails on extrapolation — structural inability to predict outside input domain |
| exp-fno-modes | FNO is parameter-efficient above ~500K params; Transformer wins below that |
When the number of items to be discriminated (N) reaches the embedding dimension (d), standard dot-product attention fails — not gradually, but sharply. Accuracy drops from 99.9% to 19.1% as tree depth increases by a single level (32 → 64 leaves in a 64-dimensional space). This is a structural failure, not a training failure.
The geometric reason: dot-product similarity is a 1D projection. When you need to discriminate 64 items using 64-dimensional embeddings, a single scalar projection cannot simultaneously satisfy all pairwise distance constraints. The problem is overdetermined.
Replacing the dot-product similarity (1D projection) with subspace overlap similarity (2D subspace comparison) solves the packing problem at d/N = 1, achieving 97.3% where standard attention gets 19.1%. Critically, a parameter-matched variant — with identical Q/K parameter counts — achieves 97.9%. The advantage is geometric, not a result of having more parameters.
The tradeoff: Grassmannian logits are non-negative (||Q^T K||² ≥ 0 always), which prevents the sharp suppression needed for multi-hop chaining tasks. It is the right tool for selection tasks; standard attention remains better for chain-following.
Using the Hopfield/attention equivalence (Ramsauer et al. 2020), this study confirmed that the mechanism behind factual confabulation is spurious retrieval: when two entities share similar key representations, attention produces a value output biased toward both. At key similarity 0.99, the retrieved value is almost equally consistent with the correct entity's fact and the foil's fact — while rank-1 key accuracy remains 1.0. The model retrieved the right key but blended the wrong value into the output.
Standard 1/√d scaling attenuates this bias by 11× but doesn't eliminate it. This points to key-orthogonalisation (contrastive loss on keys during training) as the architecturally motivated mitigation, not larger context windows or more capacity.
Fourier Neural Operators win on operator learning tasks where the input and output are on the same spatial domain (e.g., the 1D heat equation). They fail completely on any task requiring prediction at new positions — a structural limitation of the rfft → learnable weights → irfft pipeline. For language modelling, FNO provides no advantage and significant overhead.
Each experiment script lists its own imports. Common dependencies:
pip install torch transformers numpy scipySome experiments use datasets (Hugging Face) for downloading GPT-2 attention weights. All experiments run on CPU or a single GPU.
If you use these findings, build on the experiments, or reference this study, please cite it as:
@misc{voidwolf2026novelllm,
author = {Voidwolf},
title = {Novel LLM Architectures: An Experimental Study},
year = {2026},
howpublished = {\url{https://github.com/thevoidwolf/novel-llm-architectures}},
note = {Independent research: 7 rounds, 19 experiments across Grassmannian attention,
FNO, p-adic geometry, and spectral structure}
}Plain text:
Voidwolf. Novel LLM Architectures: An Experimental Study. 2026. https://github.com/thevoidwolf/novel-llm-architectures
- Hopfield/attention equivalence: Ramsauer et al., Hopfield Networks is All You Need (2020)
- Grassmannian manifolds in ML: Hamm & Lee (2008), various graph neural network literature
- FNO: Li et al., Fourier Neural Operator for Parametric Partial Differential Equations (2020)
- p-adic / ultrametric geometry: various literature on hierarchical embeddings
- Related work — Grassmannian attention: Zhang Chong, Attention Is Not What You Need (December 2025) proposes a Causal Grassmann Layer that encodes local token pairs as 2D subspaces on the Grassmann manifold via Plücker coordinates, replacing the attention matrix entirely with Grassmann flows. This is parallel work with a distinct approach: that paper proposes an attention-free architecture, whereas this study uses subspace overlap (‖Q^T K‖_F²) as a drop-in replacement for the dot-product similarity score within standard attention. The phase transition finding, parameter-matched geometric vs parametric comparison, and the effect on the transition boundary are independent of Zhang Chong's work. arXiv:2512.19428
The specific claims in this study — the d/N phase transition boundary, the geometric confirmation of Grassmannian's advantage via parameter matching, and the value-blend characterisation of spurious retrieval at practical similarity levels — are not, to the author's knowledge, documented in prior literature in this form. Prior art checks are recommended before asserting novelty in any publication context.
This research was conducted as an independent exploration of LLM architecture. All experiments were run on consumer hardware (NVIDIA RTX 3060).