Equilibrium Associative Memory (EAM)

A small, runnable seed of the "GPU-native, founded-by-physicists" architecture from our conversation. It is dense matmul + FFT + softmax throughout, i.e. exactly what a GPU wants, and it trains end to end with one weight-tied block solved to a fixed point.

Honest scope: this is a DEQ-transformer with holographic positional binding and parallel denoising, not a frontier system. The point is pedagogical and architectural: the ordinary transformer falls out of this as the special case where you (a) freeze the block feedforward instead of solving to equilibrium and (b) decode autoregressively instead of in parallel. EAM is the more general object; the transformer is one corner of it.

Want the scale-ready version? See SCALING.md and scaled_model.py / scaled_train.py: a modern transformer (RMSNorm, SwiGLU, RoPE, QK-norm, GQA, sliding-window, optional MoE) with weight-tied recurrent depth as a test-time-compute knob in place of a literal fixed point, plus an honest per-pillar verdict on which of these four moves actually survive at scale.

Does that recurrent-depth knob actually work? See FINDINGS.md for a seven-experiment investigation (exp*.py, figures in figures/): the dial matches a 16-layer transformer in-distribution at 1/16 the parameters and beats it out-of-distribution, where it breaks (latent drift on chaotic dynamics), and the re-grounding fix that gives perfect extrapolation to 32× the trained horizon.

The four moves, and where each lives in the code

Thesis move	Concrete mechanism	Code
Equilibrium / implicit depth	one weight-tied block iterated to a fixed point (Deep Equilibrium model); last `grad_tail` steps carry gradient (Jacobian-Free Backprop), so training memory is independent of solver depth	`model.py: EquilibriumAssocMemory.solve`
Associative memory == attention	`softmax(QK^T/sqrt d)V` is the modern Hopfield retrieval rule; the block exposes the Hopfield energy as a diagnostic	`model.py: AssocBlock._attn`, `hopfield_energy`
Vector-symbolic representation	position encoded by HRR circular-convolution binding of a role vector into each token, not additive positional encoding	`model.py: circular_bind`, `inject`
Non-autoregressive generation	trained as a masked denoiser; decoded by parallel iterative unmasking on a cosine schedule (MaskGIT-style)	`model.py: generate`; `tasks.py`

Run it

pip install -r requirements.txt          # just torch
python run.py                            # sort task, ~1.6M params, autodetects CUDA

Useful switches:

python run.py --task reverse             # copy | reverse | sort
python run.py --grad-tail 1              # pure Jacobian-Free Backprop: O(1) solver memory
python run.py --solver-steps 24          # deeper fixed-point solve -> smaller residual
python run.py --no-binding               # ablate HRR binding -> learned positional encoding
python run.py --dim 512 --heads 8 --length 24 --steps 6000   # scale up on a real GPU

CPU smoke test (no GPU needed, just checks it runs and learns):

python run.py --steps 250 --batch 64 --dim 96 --length 8 --solver-steps 12 --no-amp

What you should see

Generation accuracy climbs from chance toward ~1.0 on the synthetic tasks, the fixed-point residual ||dz||/||z|| shrinks toward 0 across solver steps (the state settling into its attractor), and the Hopfield energy plateaus as it converges. On a GPU with the default config the sort task reaches high accuracy in a few thousand steps; bf16 autocast is on by default for CUDA.

Things worth trying next

Swap the synthetic task for real data. Replace tasks.make_batch with a character-level masked-denoising loader; the model and decoder already do non-autoregressive generation, so it works unchanged.
Make the energy real. Right now hopfield_energy is a diagnostic. Replace the block's MLP + attention with an explicit energy whose gradient is the update, so the solve becomes literal energy descent (a proper energy-based model).
Push the VSA further. Use circular_bind for compositional binding of role/filler pairs (not just position), which is the genuine slot-structured direction rather than positional binding alone.
Proper implicit gradient. Swap Jacobian-Free Backprop for the full implicit-function-theorem gradient with an Anderson-accelerated solver if you want exact equilibrium gradients.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
figures		figures
.gitignore		.gitignore
FINDINGS.md		FINDINGS.md
LICENSE		LICENSE
README.md		README.md
SCALING.md		SCALING.md
exp05b_knob.py		exp05b_knob.py
exp0_state_tracking.py		exp0_state_tracking.py
exp1_race.py		exp1_race.py
exp2_ca.py		exp2_ca.py
exp3_ca_robust.py		exp3_ca_robust.py
exp4_reground.py		exp4_reground.py
model.py		model.py
requirements.txt		requirements.txt
run.py		run.py
scaled_model.py		scaled_model.py
scaled_train.py		scaled_train.py
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Equilibrium Associative Memory (EAM)

The four moves, and where each lives in the code

Run it

What you should see

Things worth trying next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Equilibrium Associative Memory (EAM)

The four moves, and where each lives in the code

Run it

What you should see

Things worth trying next

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages