Toy-GPT: train-401-context-3-llm-glossary

Demonstrates, at very small scale, how a language model is trained.

This repository is part of a series of toy training repositories plus a companion client repository:

Training repositories produce pretrained artifacts (vocabulary, weights, metadata).
A web app loads the artifacts and provides an interactive prompt.

This repository makes the sparsity problem visible and measurable. The 428 MB file can be generated and inspected, but it's not committed because storing 8 million mostly-zero rows in a git repository serves no practical purpose. Running the training script locally generates it in seconds and makes the point quite well.

Combinatorial Explosion

At vocabulary size V and context window W, the weight matrix has V^W rows. This combinatorial explosion of context-window models is exactly why real language models use embeddings and attention instead of explicit lookup tables.

The smaller corpora (cat/dog, animals) are used for committed artifacts precisely because their vocabularies are tiny.

Scope

This is:

an intentionally inspectable training pipeline
a next-token predictor trained on an explicit corpus
a demonstration of why naive context-window scaling fails at non-trivial vocabulary sizes

This is not:

a production system
a full Transformer implementation
a chat interface
a claim of semantic understanding

Outputs

Training runs successfully and produces all artifacts locally. Only artifacts/00_meta.json and artifacts/01_vocabulary.csv are committed, as they are small and sufficient to inspect vocabulary and model metadata.

Training logs and evidence are written under outputs/ (for example, outputs/train_log.csv).

Quick Start

See SETUP.md for full setup and workflow instructions.

Run the full training script:

uv run python src/toy_gpt_train/d_train.py

Run individually:

a/b/c are demos (can be run alone if desired)
d_train produces artifacts
e_infer consumes artifacts

uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.py

Command Reference

The commands below are used in the workflow guide above. They are provided here for convenience.

Follow the guide for the full instructions.

Show command reference

In a machine terminal (open in your `Repos` folder)

After you get a copy of this repo in your own GitHub account, open a machine terminal in your Repos folder:

# Replace username with YOUR GitHub username.
git clone https://github.com/username/train-401-context-3-llm-glossary

cd train-401-context-3-llm-glossary
code .

In a VS Code terminal

uv self update
uv python pin 3.14
uv sync --extra dev --extra docs --upgrade

uvx pre-commit install
git add -A
uvx pre-commit run --all-files

# run Python

uv run ruff format .
uv run ruff check . --fix
uv run zensical build

git add -A
git commit -m "update"
git push -u origin main

Provenance and Purpose

The primary corpus used for training is declared in SE_MANIFEST.toml.

This repository commits pretrained artifacts so the client can run without retraining.

Annotations

ANNOTATIONS.md - REQ/WHY/OBS annotations used

Citation

CITATION.cff

License

MIT

SE Manifest

SE_MANIFEST.toml - project intent, scope, and role

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
artifacts		artifacts
corpus		corpus
docs		docs
outputs		outputs
src/toy_gpt_train		src/toy_gpt_train
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.yamllint.yml		.yamllint.yml
ANNOTATIONS.md		ANNOTATIONS.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
SE_MANIFEST.toml		SE_MANIFEST.toml
config.toml		config.toml
lychee.toml		lychee.toml
pyproject.toml		pyproject.toml
ruff.strict.toml		ruff.strict.toml
uv.lock		uv.lock
zensical.toml		zensical.toml

Corpus	Vocab size	Model	Weight matrix rows	Approx. size
cat/dog	~20 tokens	context-3	20³ = 8,000	~3 MB
llm_glossary	~119 tokens	context-2	119² = 14,161	~10 MB
llm_glossary	~119 tokens	context-3	119³ = 1,685,159	~428 MB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Toy-GPT: train-401-context-3-llm-glossary

Contents

⚠️ Large artifacts are excluded from this repository

Large, but Sparse (Mostly zeros / empty)

Combinatorial Explosion

Scope

Outputs

Quick Start

Command Reference

In a machine terminal (open in your `Repos` folder)

In a VS Code terminal

Provenance and Purpose

Annotations

Citation

License

SE Manifest

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Toy-GPT: train-401-context-3-llm-glossary

Contents

⚠️ Large artifacts are excluded from this repository

Large, but Sparse (Mostly zeros / empty)

Combinatorial Explosion

Scope

Outputs

Quick Start

Command Reference

In a machine terminal (open in your Repos folder)

In a VS Code terminal

Provenance and Purpose

Annotations

Citation

License

SE Manifest

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

In a machine terminal (open in your `Repos` folder)