Reproducible setup to run deepreinforce-ai/Ornith-1.0-35B
— a reasoning-focused agentic coding model — fully on a single RTX 4090, served by CUDA
llama.cpp and driven by the Pi coding agent.
The whole Q4_K_M model fits on the 24 GB card and runs at ~210 tok/s generation / ~5200 tok/s prefill.
This repo contains only what's needed to reproduce the setup — scripts, a Dockerfile, configs, and docs. The 20 GB model, the compiled
llama.cpptree, and Node/Pi are regenerated by the scripts and are git-ignored.
Ornith-1.0-35B is a Qwen3.5-MoE (qwen35moe 35B.A3B): 34.66B params, 256 experts but only
8 active/token (~3B active), and hybrid linear attention (full attention only every 4th
layer). That last fact makes the KV cache tiny (~20 KB/token), so all 19.7 GiB of Q4_K_M
weights live on the GPU with room left for a large context. The one tuning knob — how many expert
layers to offload to CPU (--n-cpu-moe) — is best left at 0 (everything on GPU).
--n-cpu-moe |
prefill t/s | gen t/s | context | VRAM @ ncmoe=0 | |
|---|---|---|---|---|---|
| 0 | 5219 | 210 | 32K | 21.4 GB | |
| 4 | 1994 | 147 | 64K | 22.0 GB (default) | |
| 8 | 1267 | 110 | 128K | 23.4 GB (fits, tight) | |
| 16 | 795 | 73 | 262K | OOM → needs ncmoe≈6 |
Requires an NVIDIA driver ≥ 550 and Docker. One-time host setup installs the NVIDIA Container Toolkit:
sudo ./scripts/00-host-prereqs.sh # installs/registers nvidia-container-toolkit
./scripts/10-download-model.sh # ~21 GB Q4_K_M -> ./models
docker compose up -d --build # compiles llama.cpp from source + installs Pi, then serves
docker compose logs -f # wait for "server is listening" (~18s)
# use it:
curl http://localhost:8090/v1/chat/completions -d '{"messages":[{"role":"user","content":"hi"}]}'
docker exec -it ornith pi-ornith # Pi coding agent (add --128k for 128K context)Day-to-day commands (start/stop/restart/logs, mounting your code) — see docs/docker-quickstart.md.
Requires NVIDIA driver + CUDA toolkit (nvcc).
./scripts/10-download-model.sh # model -> ./models
./scripts/20-build-llama-cuda.sh # build llama.cpp (CUDA) -> ./build/llama.cpp
./scripts/30-install-node-pi.sh # Node + Pi -> ./build/node
./scripts/40-configure-pi.sh # write ~/.pi/agent/models.json
./scripts/serve-ornith.sh & # serve on :8090 (arg = context, e.g. 131072)
./scripts/pi-ornith # Pi agent (--128k for 128K)The Ornith server already listens on 0.0.0.0:8090, so Pi can run on a different box with
no model, GPU, or llama.cpp build — just Node + Pi:
# on the client machine (one-time): install Node + Pi only
./scripts/30-install-node-pi.sh
# talk to the remote server (host / host:port / full url):
./scripts/pi-remote gpu-box # interactive, http://gpu-box:8090
./scripts/pi-remote gpu-box --128k --continue
./scripts/pi-remote http://10.0.0.5:8090 -p "fix the failing test"pi-remote never starts a local server — it just points Pi at the remote one (writing
~/.pi/agent/models.json for you) and health-checks it first. The endpoint is unauthenticated,
so keep it on a trusted network or tunnel over SSH:
ssh -N -L 8090:localhost:8090 user@gpu-box # then: ./scripts/pi-remote localhost.
├── README.md # this file
├── docker-compose.yml # build-from-source image + GPU + model volume
├── scripts/
│ ├── 00-host-prereqs.sh # nvidia-container-toolkit (Docker path)
│ ├── 10-download-model.sh # fetch a GGUF quant from HF
│ ├── 20-build-llama-cuda.sh # build llama.cpp (CUDA, pinned commit)
│ ├── 30-install-node-pi.sh # Node tarball + pi-coding-agent
│ ├── 40-configure-pi.sh # install Pi model config
│ ├── serve-ornith.sh # run llama-server (host)
│ ├── pi-ornith # run Pi against the local server (host)
│ └── pi-remote # run Pi against a REMOTE server (client-only)
├── docker/
│ ├── Dockerfile.source # multi-stage: compile llama.cpp + install Pi
│ └── container/
│ ├── serve.sh # in-container server launcher
│ └── pi-ornith # in-container Pi launcher
├── config/
│ └── pi-models.json # Pi model config (ornith 64K + ornith-128k)
└── docs/
├── docker-quickstart.md # start/stop/daily Docker ops
├── docker-setup.md # detailed Docker writeup (build internals)
├── baremetal-setup.md # detailed as-built build + benchmark writeup
└── PI_SYSTEM_PROMPT.md # Pi's base system prompt: where it lives, how to override
models/ and build/ are created by the scripts and git-ignored.
Env vars (set in docker-compose.yml, or -e/export for the host scripts):
| Var | Default | Meaning |
|---|---|---|
ORNITH_CTX |
65536 |
context window. 65536≈22 GB · 131072≈23.4 GB · 262144 needs ORNITH_NCMOE>0 |
ORNITH_NCMOE |
0 |
expert layers kept on CPU. 0 = whole model on GPU (fastest) |
ORNITH_PARALLEL |
1 |
concurrent request slots. >1 lets multiple Pi sessions run at once; ORNITH_CTX splits across slots (per-client = CTX/PARALLEL) |
ORNITH_MODEL_DIR |
./models |
host dir holding the GGUF (mounted at /models) |
ORNITH_SERVER_URL |
http://localhost:8090 |
server the Pi client points at — host, host:port, or full url (used by 40-configure-pi.sh / pi-remote) |
LLAMA_COMMIT / CUDA_ARCH |
pinned / 89 |
build-time pins (CUDA_ARCH 86=Ampere, 89=Ada, 90=Hopper) |
NODE_VERSION / PI_VERSION |
v24.18.0 / 0.80.2 |
Node + Pi versions |
Notes
- Ornith is a reasoning model: chain-of-thought goes to the API
reasoning_contentfield, the answer tocontent. Give generousmax_tokensorcontentcomes back empty. (config/pi-models.jsonalready setsreasoning: true.) - Only one container/process can hold the full model at full offload on a single 24 GB GPU.
| Thing | Version |
|---|---|
| llama.cpp | commit 050ee92d04c2e1f639025786dea701c70e7d4204 |
| Base images | nvidia/cuda:12.3.2-{devel,runtime}-ubuntu22.04 |
| CUDA toolkit (bare metal) | 12.3 (nvcc V12.3.103) |
| NVIDIA driver | 550.144.03 · RTX 4090 (sm_89) |
| NVIDIA Container Toolkit | 1.19.1 |
| Node / Pi | v24.18.0 (LTS) / @earendil-works/pi-coding-agent@0.80.2 |
| Model | ornith-1.0-35b-Q4_K_M.gguf (21,166,757,760 bytes) |
See docs/ for the full step-by-step writeups, including every gotcha hit during the original
build (Linux has no prebuilt CUDA llama.cpp; snap Node fails outside /home; the CUDA driver-stub
link fix for the from-source image; etc.).