Orin — Ornith-1.0-35B local coding-agent stack (RTX 4090)

Reproducible setup to run deepreinforce-ai/Ornith-1.0-35B — a reasoning-focused agentic coding model — fully on a single RTX 4090, served by CUDA llama.cpp and driven by the Pi coding agent.

The whole Q4_K_M model fits on the 24 GB card and runs at ~210 tok/s generation / ~5200 tok/s prefill.

This repo contains only what's needed to reproduce the setup — scripts, a Dockerfile, configs, and docs. The 20 GB model, the compiled llama.cpp tree, and Node/Pi are regenerated by the scripts and are git-ignored.

Why it fits on 24 GB

Ornith-1.0-35B is a Qwen3.5-MoE (qwen35moe 35B.A3B): 34.66B params, 256 experts but only 8 active/token (~3B active), and hybrid linear attention (full attention only every 4th layer). That last fact makes the KV cache tiny (~20 KB/token), so all 19.7 GiB of Q4_K_M weights live on the GPU with room left for a large context. The one tuning knob — how many expert layers to offload to CPU (--n-cpu-moe) — is best left at 0 (everything on GPU).

Benchmarks (Q4_K_M, RTX 4090, `-ngl 99 -fa on`)

`--n-cpu-moe`	prefill t/s	gen t/s	context	VRAM @ ncmoe=0
0	5219	210	32K	21.4 GB
4	1994	147	64K	22.0 GB (default)
8	1267	110	128K	23.4 GB (fits, tight)
16	795	73	262K	OOM → needs `ncmoe≈6`

Quick start (Docker — recommended)

Requires an NVIDIA driver ≥ 550 and Docker. One-time host setup installs the NVIDIA Container Toolkit:

sudo ./scripts/00-host-prereqs.sh        # installs/registers nvidia-container-toolkit
./scripts/10-download-model.sh           # ~21 GB Q4_K_M -> ./models
docker compose up -d --build             # compiles llama.cpp from source + installs Pi, then serves
docker compose logs -f                    # wait for "server is listening" (~18s)

# use it:
curl http://localhost:8090/v1/chat/completions -d '{"messages":[{"role":"user","content":"hi"}]}'
docker exec -it ornith pi-ornith          # Pi coding agent (add --128k for 128K context)

Day-to-day commands (start/stop/restart/logs, mounting your code) — see docs/docker-quickstart.md.

Quick start (bare metal)

Requires NVIDIA driver + CUDA toolkit (nvcc).

./scripts/10-download-model.sh           # model -> ./models
./scripts/20-build-llama-cuda.sh         # build llama.cpp (CUDA) -> ./build/llama.cpp
./scripts/30-install-node-pi.sh          # Node + Pi -> ./build/node
./scripts/40-configure-pi.sh             # write ~/.pi/agent/models.json
./scripts/serve-ornith.sh &              # serve on :8090  (arg = context, e.g. 131072)
./scripts/pi-ornith                       # Pi agent (--128k for 128K)

Connect from another machine (remote server)

The Ornith server already listens on 0.0.0.0:8090, so Pi can run on a different box with no model, GPU, or llama.cpp build — just Node + Pi:

# on the client machine (one-time): install Node + Pi only
./scripts/30-install-node-pi.sh

# talk to the remote server (host / host:port / full url):
./scripts/pi-remote gpu-box                  # interactive, http://gpu-box:8090
./scripts/pi-remote gpu-box --128k --continue
./scripts/pi-remote http://10.0.0.5:8090 -p "fix the failing test"

pi-remote never starts a local server — it just points Pi at the remote one (writing ~/.pi/agent/models.json for you) and health-checks it first. The endpoint is unauthenticated, so keep it on a trusted network or tunnel over SSH:

ssh -N -L 8090:localhost:8090 user@gpu-box   # then: ./scripts/pi-remote localhost

Repo layout

.
├── README.md                     # this file
├── docker-compose.yml            # build-from-source image + GPU + model volume
├── scripts/
│   ├── 00-host-prereqs.sh        # nvidia-container-toolkit (Docker path)
│   ├── 10-download-model.sh      # fetch a GGUF quant from HF
│   ├── 20-build-llama-cuda.sh    # build llama.cpp (CUDA, pinned commit)
│   ├── 30-install-node-pi.sh     # Node tarball + pi-coding-agent
│   ├── 40-configure-pi.sh        # install Pi model config
│   ├── serve-ornith.sh           # run llama-server (host)
│   ├── pi-ornith                 # run Pi against the local server (host)
│   └── pi-remote                 # run Pi against a REMOTE server (client-only)
├── docker/
│   ├── Dockerfile.source         # multi-stage: compile llama.cpp + install Pi
│   └── container/
│       ├── serve.sh              # in-container server launcher
│       └── pi-ornith             # in-container Pi launcher
├── config/
│   └── pi-models.json            # Pi model config (ornith 64K + ornith-128k)
└── docs/
    ├── docker-quickstart.md      # start/stop/daily Docker ops
    ├── docker-setup.md           # detailed Docker writeup (build internals)
    ├── baremetal-setup.md        # detailed as-built build + benchmark writeup
    └── PI_SYSTEM_PROMPT.md       # Pi's base system prompt: where it lives, how to override

models/ and build/ are created by the scripts and git-ignored.

Configuration

Env vars (set in docker-compose.yml, or -e/export for the host scripts):

Var	Default	Meaning
`ORNITH_CTX`	`65536`	context window. 65536≈22 GB · 131072≈23.4 GB · 262144 needs `ORNITH_NCMOE>0`
`ORNITH_NCMOE`	`0`	expert layers kept on CPU. `0` = whole model on GPU (fastest)
`ORNITH_PARALLEL`	`1`	concurrent request slots. >1 lets multiple Pi sessions run at once; `ORNITH_CTX` splits across slots (per-client = CTX/PARALLEL)
`ORNITH_MODEL_DIR`	`./models`	host dir holding the GGUF (mounted at `/models`)
`ORNITH_SERVER_URL`	`http://localhost:8090`	server the Pi client points at — `host`, `host:port`, or full url (used by `40-configure-pi.sh` / `pi-remote`)
`LLAMA_COMMIT` / `CUDA_ARCH`	pinned / `89`	build-time pins (`CUDA_ARCH` 86=Ampere, 89=Ada, 90=Hopper)
`NODE_VERSION` / `PI_VERSION`	`v24.18.0` / `0.80.2`	Node + Pi versions

Notes

Ornith is a reasoning model: chain-of-thought goes to the API reasoning_content field, the answer to content. Give generous max_tokens or content comes back empty. (config/pi-models.json already sets reasoning: true.)
Only one container/process can hold the full model at full offload on a single 24 GB GPU.

Pinned versions (reference build)

Thing	Version
llama.cpp	commit `050ee92d04c2e1f639025786dea701c70e7d4204`
Base images	`nvidia/cuda:12.3.2-{devel,runtime}-ubuntu22.04`
CUDA toolkit (bare metal)	12.3 (nvcc V12.3.103)
NVIDIA driver	550.144.03 · RTX 4090 (sm_89)
NVIDIA Container Toolkit	1.19.1
Node / Pi	v24.18.0 (LTS) / `@earendil-works/pi-coding-agent@0.80.2`
Model	`ornith-1.0-35b-Q4_K_M.gguf` (21,166,757,760 bytes)

See docs/ for the full step-by-step writeups, including every gotcha hit during the original build (Linux has no prebuilt CUDA llama.cpp; snap Node fails outside /home; the CUDA driver-stub link fix for the from-source image; etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Orin — Ornith-1.0-35B local coding-agent stack (RTX 4090)

Why it fits on 24 GB

Benchmarks (Q4_K_M, RTX 4090, `-ngl 99 -fa on`)

Quick start (Docker — recommended)

Quick start (bare metal)

Connect from another machine (remote server)

Repo layout

Configuration

Pinned versions (reference build)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
docker		docker
docs		docs
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Orin — Ornith-1.0-35B local coding-agent stack (RTX 4090)

Why it fits on 24 GB

Benchmarks (Q4_K_M, RTX 4090, -ngl 99 -fa on)

Quick start (Docker — recommended)

Quick start (bare metal)

Connect from another machine (remote server)

Repo layout

Configuration

Pinned versions (reference build)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Benchmarks (Q4_K_M, RTX 4090, `-ngl 99 -fa on`)

Packages