Skip to content

thevoidwolf/novel-llm-architectures

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Novel LLM Architectures — An Experimental Study

7 rounds · 19 experiments · 2026


What is this?

This repository documents a seven-round investigation into whether alternative computational geometries can improve upon the standard dot-product attention mechanism used in transformer language models.

The starting question was simple: is the way attention works the best way, or just the most obvious way? Every modern LLM uses the same basic trick — compute how similar each word is to every other word using a dot product, and route information accordingly. This study asked what happens when you swap that primitive for something geometrically richer.

The investigation covered five architectural ideas: spectral/frequency representations, Hopfield associative memory theory, p-adic ultrametric geometry, Fourier Neural Operators (FNO), and Grassmannian manifold similarity. Not all of them worked. The ones that didn't are documented here too, because knowing why an idea fails is as useful as knowing why one succeeds.


The results in one paragraph

Standard dot-product attention has a sharp, previously uncharacterised failure boundary: when the number of items to retrieve approaches the embedding dimension (the d/N = 1 threshold), accuracy drops from ~100% to ~20% in a single step. Grassmannian attention — replacing scalar dot-product similarity with subspace overlap similarity — moves this boundary outward by at least a factor of two, with identical parameter cost. A separate thread revealed the likely mechanism behind LLM hallucination: when two entities have similar key representations, attention produces a value-blend biased toward both — a spurious retrieval effect, not a capacity failure. Fourier Neural Operators were precisely characterised: excellent at fixed-domain physics simulation, structurally unable to handle language tasks.


Repository structure

experiments/
  round-1/   Spectral diagnostics on GPT-2 weights and attention patterns
  round-2/   Compression attempts — can we exploit spectral structure?
  round-3/   Hopfield theory — capacity and spurious retrieval
  round-4/   FNO, p-adic geometry — first alternative architectures
  round-5/   Grassmannian attention — first tests
  round-6/   Phase transition characterisation — the core finding
  round-7/   FNO on physics tasks — completing the picture

Each experiment folder contains:

  • exp-name.py — the runnable experiment script
  • exp-name.md — what was asked, what was found, and what it means in plain English
  • *-results.json — the raw numerical results

The experiments at a glance

Round 1 — Spectral diagnostics

Experiment One-line verdict
exp-attn-spectrum Attention patterns ARE spectrally concentrated — but for structural reasons, not semantic ones
exp-activation-spectrum Activations show a U-shaped depth profile; final layer is anomalously compressed
exp-svd-rank Weight matrices are spectrally flat — no compression structure exists

Round 2 — Can we compress using spectral structure?

Experiment One-line verdict
exp-attn-mask-baseline The spectral pattern comes from learned head geometry, not the causal mask
exp-attn-fft-perplexity FFT compression fails catastrophically — the signal lives in the high-frequency components

Round 3 — What does Hopfield theory predict about hallucination?

Experiment One-line verdict
exp-hopfield-capacity Capacity is not the problem — a GPT-2 head has room for 10¹³ memories
exp-attn-spurious-retrieval Spurious retrieval is real — similar keys produce value-blended outputs; this is the hallucination mechanism

Round 4 — Alternative architectures

Experiment One-line verdict
exp-fno-hierarchy FNO is 6× slower than attention on hierarchical retrieval — wrong tool
exp-fno-recall FNO outperforms attention on 2-hop recall — because attention has a structural weakness there
exp-padic-geometry Euclidean space can represent tree distances perfectly when d ≥ N
exp-padic-attention Oracle p-adic geometry adds nothing — the failure is in learning, not geometry

Round 5 — Grassmannian attention: first tests

Experiment One-line verdict
exp-grassmann-hierarchy Grassmannian converges 2.5× faster on hierarchical retrieval
exp-grassmann-recall Grassmannian is slightly worse on 2-hop recall — identifies a limitation
exp-grassmann-k Increasing subspace dimension k doesn't help 2-hop tasks — the problem is elsewhere

Round 6 — The phase transition

Experiment One-line verdict
exp-depth-sweep Sharp failure at d/N = 1; Grassmannian spans the transition at 97.3% vs standard's 19.1%
exp-depth-params The advantage is geometric, not parametric — confirmed with matched parameter counts

Round 7 — FNO's actual domain

Experiment One-line verdict
exp-fno-pde FNO wins on 1D heat equation (rel-L2 = 0.0078 vs 0.0108) — learnable frequency weights are key
exp-fno-periodic FNO completely fails on extrapolation — structural inability to predict outside input domain
exp-fno-modes FNO is parameter-efficient above ~500K params; Transformer wins below that

Key findings

1. Standard attention has a phase transition at d/N = 1

When the number of items to be discriminated (N) reaches the embedding dimension (d), standard dot-product attention fails — not gradually, but sharply. Accuracy drops from 99.9% to 19.1% as tree depth increases by a single level (32 → 64 leaves in a 64-dimensional space). This is a structural failure, not a training failure.

The geometric reason: dot-product similarity is a 1D projection. When you need to discriminate 64 items using 64-dimensional embeddings, a single scalar projection cannot simultaneously satisfy all pairwise distance constraints. The problem is overdetermined.

2. Grassmannian attention moves the boundary

Replacing the dot-product similarity (1D projection) with subspace overlap similarity (2D subspace comparison) solves the packing problem at d/N = 1, achieving 97.3% where standard attention gets 19.1%. Critically, a parameter-matched variant — with identical Q/K parameter counts — achieves 97.9%. The advantage is geometric, not a result of having more parameters.

The tradeoff: Grassmannian logits are non-negative (||Q^T K||² ≥ 0 always), which prevents the sharp suppression needed for multi-hop chaining tasks. It is the right tool for selection tasks; standard attention remains better for chain-following.

3. Hallucination is a value-blend problem, not a capacity problem

Using the Hopfield/attention equivalence (Ramsauer et al. 2020), this study confirmed that the mechanism behind factual confabulation is spurious retrieval: when two entities share similar key representations, attention produces a value output biased toward both. At key similarity 0.99, the retrieved value is almost equally consistent with the correct entity's fact and the foil's fact — while rank-1 key accuracy remains 1.0. The model retrieved the right key but blended the wrong value into the output.

Standard 1/√d scaling attenuates this bias by 11× but doesn't eliminate it. This points to key-orthogonalisation (contrastive loss on keys during training) as the architecturally motivated mitigation, not larger context windows or more capacity.

4. FNO belongs in physics, not language

Fourier Neural Operators win on operator learning tasks where the input and output are on the same spatial domain (e.g., the 1D heat equation). They fail completely on any task requiring prediction at new positions — a structural limitation of the rfft → learnable weights → irfft pipeline. For language modelling, FNO provides no advantage and significant overhead.


Dependencies

Each experiment script lists its own imports. Common dependencies:

pip install torch transformers numpy scipy

Some experiments use datasets (Hugging Face) for downloading GPT-2 attention weights. All experiments run on CPU or a single GPU.


Citing this work

If you use these findings, build on the experiments, or reference this study, please cite it as:

@misc{voidwolf2026novelllm,
  author       = {Voidwolf},
  title        = {Novel LLM Architectures: An Experimental Study},
  year         = {2026},
  howpublished = {\url{https://github.com/thevoidwolf/novel-llm-architectures}},
  note         = {Independent research: 7 rounds, 19 experiments across Grassmannian attention,
                  FNO, p-adic geometry, and spectral structure}
}

Plain text:

Voidwolf. Novel LLM Architectures: An Experimental Study. 2026. https://github.com/thevoidwolf/novel-llm-architectures


Prior work and attribution

  • Hopfield/attention equivalence: Ramsauer et al., Hopfield Networks is All You Need (2020)
  • Grassmannian manifolds in ML: Hamm & Lee (2008), various graph neural network literature
  • FNO: Li et al., Fourier Neural Operator for Parametric Partial Differential Equations (2020)
  • p-adic / ultrametric geometry: various literature on hierarchical embeddings
  • Related work — Grassmannian attention: Zhang Chong, Attention Is Not What You Need (December 2025) proposes a Causal Grassmann Layer that encodes local token pairs as 2D subspaces on the Grassmann manifold via Plücker coordinates, replacing the attention matrix entirely with Grassmann flows. This is parallel work with a distinct approach: that paper proposes an attention-free architecture, whereas this study uses subspace overlap (‖Q^T K‖_F²) as a drop-in replacement for the dot-product similarity score within standard attention. The phase transition finding, parameter-matched geometric vs parametric comparison, and the effect on the transition boundary are independent of Zhang Chong's work. arXiv:2512.19428

The specific claims in this study — the d/N phase transition boundary, the geometric confirmation of Grassmannian's advantage via parameter matching, and the value-blend characterisation of spurious retrieval at practical similarity levels — are not, to the author's knowledge, documented in prior literature in this form. Prior art checks are recommended before asserting novelty in any publication context.


This research was conducted as an independent exploration of LLM architecture. All experiments were run on consumer hardware (NVIDIA RTX 3060).

About

This repository documents a seven-round investigation into whether alternative computational geometries can improve upon the standard dot-product attention mechanism used in transformer language models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages