A small, runnable seed of the "GPU-native, founded-by-physicists" architecture from our conversation. It is dense matmul + FFT + softmax throughout, i.e. exactly what a GPU wants, and it trains end to end with one weight-tied block solved to a fixed point.
Honest scope: this is a DEQ-transformer with holographic positional binding and parallel denoising, not a frontier system. The point is pedagogical and architectural: the ordinary transformer falls out of this as the special case where you (a) freeze the block feedforward instead of solving to equilibrium and (b) decode autoregressively instead of in parallel. EAM is the more general object; the transformer is one corner of it.
Want the scale-ready version? See
SCALING.mdandscaled_model.py/scaled_train.py: a modern transformer (RMSNorm, SwiGLU, RoPE, QK-norm, GQA, sliding-window, optional MoE) with weight-tied recurrent depth as a test-time-compute knob in place of a literal fixed point, plus an honest per-pillar verdict on which of these four moves actually survive at scale.Does that recurrent-depth knob actually work? See
FINDINGS.mdfor a seven-experiment investigation (exp*.py, figures infigures/): the dial matches a 16-layer transformer in-distribution at 1/16 the parameters and beats it out-of-distribution, where it breaks (latent drift on chaotic dynamics), and the re-grounding fix that gives perfect extrapolation to 32× the trained horizon.
| Thesis move | Concrete mechanism | Code |
|---|---|---|
| Equilibrium / implicit depth | one weight-tied block iterated to a fixed point (Deep Equilibrium model); last grad_tail steps carry gradient (Jacobian-Free Backprop), so training memory is independent of solver depth |
model.py: EquilibriumAssocMemory.solve |
| Associative memory == attention | softmax(QK^T/sqrt d)V is the modern Hopfield retrieval rule; the block exposes the Hopfield energy as a diagnostic |
model.py: AssocBlock._attn, hopfield_energy |
| Vector-symbolic representation | position encoded by HRR circular-convolution binding of a role vector into each token, not additive positional encoding | model.py: circular_bind, inject |
| Non-autoregressive generation | trained as a masked denoiser; decoded by parallel iterative unmasking on a cosine schedule (MaskGIT-style) | model.py: generate; tasks.py |
pip install -r requirements.txt # just torch
python run.py # sort task, ~1.6M params, autodetects CUDAUseful switches:
python run.py --task reverse # copy | reverse | sort
python run.py --grad-tail 1 # pure Jacobian-Free Backprop: O(1) solver memory
python run.py --solver-steps 24 # deeper fixed-point solve -> smaller residual
python run.py --no-binding # ablate HRR binding -> learned positional encoding
python run.py --dim 512 --heads 8 --length 24 --steps 6000 # scale up on a real GPUCPU smoke test (no GPU needed, just checks it runs and learns):
python run.py --steps 250 --batch 64 --dim 96 --length 8 --solver-steps 12 --no-ampGeneration accuracy climbs from chance toward ~1.0 on the synthetic tasks, the
fixed-point residual ||dz||/||z|| shrinks toward 0 across solver steps (the
state settling into its attractor), and the Hopfield energy plateaus as it converges.
On a GPU with the default config the sort task reaches high accuracy in a few
thousand steps; bf16 autocast is on by default for CUDA.
- Swap the synthetic task for real data. Replace
tasks.make_batchwith a character-level masked-denoising loader; the model and decoder already do non-autoregressive generation, so it works unchanged. - Make the energy real. Right now
hopfield_energyis a diagnostic. Replace the block's MLP + attention with an explicit energy whose gradient is the update, so the solve becomes literal energy descent (a proper energy-based model). - Push the VSA further. Use
circular_bindfor compositional binding of role/filler pairs (not just position), which is the genuine slot-structured direction rather than positional binding alone. - Proper implicit gradient. Swap Jacobian-Free Backprop for the full implicit-function-theorem gradient with an Anderson-accelerated solver if you want exact equilibrium gradients.