Skip to content

hintjen/ornith-docker-pi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Orin — Ornith-1.0-35B local coding-agent stack (RTX 4090)

Reproducible setup to run deepreinforce-ai/Ornith-1.0-35B — a reasoning-focused agentic coding model — fully on a single RTX 4090, served by CUDA llama.cpp and driven by the Pi coding agent.

The whole Q4_K_M model fits on the 24 GB card and runs at ~210 tok/s generation / ~5200 tok/s prefill.

This repo contains only what's needed to reproduce the setup — scripts, a Dockerfile, configs, and docs. The 20 GB model, the compiled llama.cpp tree, and Node/Pi are regenerated by the scripts and are git-ignored.


Why it fits on 24 GB

Ornith-1.0-35B is a Qwen3.5-MoE (qwen35moe 35B.A3B): 34.66B params, 256 experts but only 8 active/token (~3B active), and hybrid linear attention (full attention only every 4th layer). That last fact makes the KV cache tiny (~20 KB/token), so all 19.7 GiB of Q4_K_M weights live on the GPU with room left for a large context. The one tuning knob — how many expert layers to offload to CPU (--n-cpu-moe) — is best left at 0 (everything on GPU).

Benchmarks (Q4_K_M, RTX 4090, -ngl 99 -fa on)

--n-cpu-moe prefill t/s gen t/s context VRAM @ ncmoe=0
0 5219 210 32K 21.4 GB
4 1994 147 64K 22.0 GB (default)
8 1267 110 128K 23.4 GB (fits, tight)
16 795 73 262K OOM → needs ncmoe≈6

Quick start (Docker — recommended)

Requires an NVIDIA driver ≥ 550 and Docker. One-time host setup installs the NVIDIA Container Toolkit:

sudo ./scripts/00-host-prereqs.sh        # installs/registers nvidia-container-toolkit
./scripts/10-download-model.sh           # ~21 GB Q4_K_M -> ./models
docker compose up -d --build             # compiles llama.cpp from source + installs Pi, then serves
docker compose logs -f                    # wait for "server is listening" (~18s)

# use it:
curl http://localhost:8090/v1/chat/completions -d '{"messages":[{"role":"user","content":"hi"}]}'
docker exec -it ornith pi-ornith          # Pi coding agent (add --128k for 128K context)

Day-to-day commands (start/stop/restart/logs, mounting your code) — see docs/docker-quickstart.md.

Quick start (bare metal)

Requires NVIDIA driver + CUDA toolkit (nvcc).

./scripts/10-download-model.sh           # model -> ./models
./scripts/20-build-llama-cuda.sh         # build llama.cpp (CUDA) -> ./build/llama.cpp
./scripts/30-install-node-pi.sh          # Node + Pi -> ./build/node
./scripts/40-configure-pi.sh             # write ~/.pi/agent/models.json
./scripts/serve-ornith.sh &              # serve on :8090  (arg = context, e.g. 131072)
./scripts/pi-ornith                       # Pi agent (--128k for 128K)

Connect from another machine (remote server)

The Ornith server already listens on 0.0.0.0:8090, so Pi can run on a different box with no model, GPU, or llama.cpp build — just Node + Pi:

# on the client machine (one-time): install Node + Pi only
./scripts/30-install-node-pi.sh

# talk to the remote server (host / host:port / full url):
./scripts/pi-remote gpu-box                  # interactive, http://gpu-box:8090
./scripts/pi-remote gpu-box --128k --continue
./scripts/pi-remote http://10.0.0.5:8090 -p "fix the failing test"

pi-remote never starts a local server — it just points Pi at the remote one (writing ~/.pi/agent/models.json for you) and health-checks it first. The endpoint is unauthenticated, so keep it on a trusted network or tunnel over SSH:

ssh -N -L 8090:localhost:8090 user@gpu-box   # then: ./scripts/pi-remote localhost

Repo layout

.
├── README.md                     # this file
├── docker-compose.yml            # build-from-source image + GPU + model volume
├── scripts/
│   ├── 00-host-prereqs.sh        # nvidia-container-toolkit (Docker path)
│   ├── 10-download-model.sh      # fetch a GGUF quant from HF
│   ├── 20-build-llama-cuda.sh    # build llama.cpp (CUDA, pinned commit)
│   ├── 30-install-node-pi.sh     # Node tarball + pi-coding-agent
│   ├── 40-configure-pi.sh        # install Pi model config
│   ├── serve-ornith.sh           # run llama-server (host)
│   ├── pi-ornith                 # run Pi against the local server (host)
│   └── pi-remote                 # run Pi against a REMOTE server (client-only)
├── docker/
│   ├── Dockerfile.source         # multi-stage: compile llama.cpp + install Pi
│   └── container/
│       ├── serve.sh              # in-container server launcher
│       └── pi-ornith             # in-container Pi launcher
├── config/
│   └── pi-models.json            # Pi model config (ornith 64K + ornith-128k)
└── docs/
    ├── docker-quickstart.md      # start/stop/daily Docker ops
    ├── docker-setup.md           # detailed Docker writeup (build internals)
    ├── baremetal-setup.md        # detailed as-built build + benchmark writeup
    └── PI_SYSTEM_PROMPT.md       # Pi's base system prompt: where it lives, how to override

models/ and build/ are created by the scripts and git-ignored.


Configuration

Env vars (set in docker-compose.yml, or -e/export for the host scripts):

Var Default Meaning
ORNITH_CTX 65536 context window. 65536≈22 GB · 131072≈23.4 GB · 262144 needs ORNITH_NCMOE>0
ORNITH_NCMOE 0 expert layers kept on CPU. 0 = whole model on GPU (fastest)
ORNITH_PARALLEL 1 concurrent request slots. >1 lets multiple Pi sessions run at once; ORNITH_CTX splits across slots (per-client = CTX/PARALLEL)
ORNITH_MODEL_DIR ./models host dir holding the GGUF (mounted at /models)
ORNITH_SERVER_URL http://localhost:8090 server the Pi client points at — host, host:port, or full url (used by 40-configure-pi.sh / pi-remote)
LLAMA_COMMIT / CUDA_ARCH pinned / 89 build-time pins (CUDA_ARCH 86=Ampere, 89=Ada, 90=Hopper)
NODE_VERSION / PI_VERSION v24.18.0 / 0.80.2 Node + Pi versions

Notes

  • Ornith is a reasoning model: chain-of-thought goes to the API reasoning_content field, the answer to content. Give generous max_tokens or content comes back empty. (config/pi-models.json already sets reasoning: true.)
  • Only one container/process can hold the full model at full offload on a single 24 GB GPU.

Pinned versions (reference build)

Thing Version
llama.cpp commit 050ee92d04c2e1f639025786dea701c70e7d4204
Base images nvidia/cuda:12.3.2-{devel,runtime}-ubuntu22.04
CUDA toolkit (bare metal) 12.3 (nvcc V12.3.103)
NVIDIA driver 550.144.03 · RTX 4090 (sm_89)
NVIDIA Container Toolkit 1.19.1
Node / Pi v24.18.0 (LTS) / @earendil-works/pi-coding-agent@0.80.2
Model ornith-1.0-35b-Q4_K_M.gguf (21,166,757,760 bytes)

See docs/ for the full step-by-step writeups, including every gotcha hit during the original build (Linux has no prebuilt CUDA llama.cpp; snap Node fails outside /home; the CUDA driver-stub link fix for the from-source image; etc.).

About

Local deployment of Ornith 1.0 LLM via Docker with GPU (cuda) acceleration and Pi coding agent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages