Skip to content

tomdif/equilibrium-assoc-memory

Repository files navigation

Equilibrium Associative Memory (EAM)

A small, runnable seed of the "GPU-native, founded-by-physicists" architecture from our conversation. It is dense matmul + FFT + softmax throughout, i.e. exactly what a GPU wants, and it trains end to end with one weight-tied block solved to a fixed point.

Honest scope: this is a DEQ-transformer with holographic positional binding and parallel denoising, not a frontier system. The point is pedagogical and architectural: the ordinary transformer falls out of this as the special case where you (a) freeze the block feedforward instead of solving to equilibrium and (b) decode autoregressively instead of in parallel. EAM is the more general object; the transformer is one corner of it.

Want the scale-ready version? See SCALING.md and scaled_model.py / scaled_train.py: a modern transformer (RMSNorm, SwiGLU, RoPE, QK-norm, GQA, sliding-window, optional MoE) with weight-tied recurrent depth as a test-time-compute knob in place of a literal fixed point, plus an honest per-pillar verdict on which of these four moves actually survive at scale.

Does that recurrent-depth knob actually work? See FINDINGS.md for a seven-experiment investigation (exp*.py, figures in figures/): the dial matches a 16-layer transformer in-distribution at 1/16 the parameters and beats it out-of-distribution, where it breaks (latent drift on chaotic dynamics), and the re-grounding fix that gives perfect extrapolation to 32× the trained horizon.

The four moves, and where each lives in the code

Thesis move Concrete mechanism Code
Equilibrium / implicit depth one weight-tied block iterated to a fixed point (Deep Equilibrium model); last grad_tail steps carry gradient (Jacobian-Free Backprop), so training memory is independent of solver depth model.py: EquilibriumAssocMemory.solve
Associative memory == attention softmax(QK^T/sqrt d)V is the modern Hopfield retrieval rule; the block exposes the Hopfield energy as a diagnostic model.py: AssocBlock._attn, hopfield_energy
Vector-symbolic representation position encoded by HRR circular-convolution binding of a role vector into each token, not additive positional encoding model.py: circular_bind, inject
Non-autoregressive generation trained as a masked denoiser; decoded by parallel iterative unmasking on a cosine schedule (MaskGIT-style) model.py: generate; tasks.py

Run it

pip install -r requirements.txt          # just torch
python run.py                            # sort task, ~1.6M params, autodetects CUDA

Useful switches:

python run.py --task reverse             # copy | reverse | sort
python run.py --grad-tail 1              # pure Jacobian-Free Backprop: O(1) solver memory
python run.py --solver-steps 24          # deeper fixed-point solve -> smaller residual
python run.py --no-binding               # ablate HRR binding -> learned positional encoding
python run.py --dim 512 --heads 8 --length 24 --steps 6000   # scale up on a real GPU

CPU smoke test (no GPU needed, just checks it runs and learns):

python run.py --steps 250 --batch 64 --dim 96 --length 8 --solver-steps 12 --no-amp

What you should see

Generation accuracy climbs from chance toward ~1.0 on the synthetic tasks, the fixed-point residual ||dz||/||z|| shrinks toward 0 across solver steps (the state settling into its attractor), and the Hopfield energy plateaus as it converges. On a GPU with the default config the sort task reaches high accuracy in a few thousand steps; bf16 autocast is on by default for CUDA.

Things worth trying next

  • Swap the synthetic task for real data. Replace tasks.make_batch with a character-level masked-denoising loader; the model and decoder already do non-autoregressive generation, so it works unchanged.
  • Make the energy real. Right now hopfield_energy is a diagnostic. Replace the block's MLP + attention with an explicit energy whose gradient is the update, so the solve becomes literal energy descent (a proper energy-based model).
  • Push the VSA further. Use circular_bind for compositional binding of role/filler pairs (not just position), which is the genuine slot-structured direction rather than positional binding alone.
  • Proper implicit gradient. Swap Jacobian-Free Backprop for the full implicit-function-theorem gradient with an Anderson-accelerated solver if you want exact equilibrium gradients.

About

GPU-native equilibrium associative memory: a DEQ-transformer with modern-Hopfield attention, HRR positional binding, and non-autoregressive MaskGIT decoding. Dense matmul + FFT + softmax throughout.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages