This runbook covers operating trinity_coordinator in service. It assumes
you have a working CUDA host and the canonical adapted artifact bundle.
The companion documents are:
docs/agent_slot_provider_mapping.md— why slot 0's training label is not a provider binding.docs/bumblebee_unpin_playbook.md— the 15-minute job to take when a Bumblebee Hex release lands.guides/operations_qc.md— quality gates for code changes (not for service operation).
XLA_TARGET=cuda12 mix trinity.env.checkA pre-flight task that validates XLA_TARGET against the bundled xla
dependency's accepted values and (optionally) that --artifact-dir
exists with a manifest.json. Fails fast with a single readable line
before EXLA loads.
Common output:
trinity.env.check: ok / xla_target=cuda12
Known gotcha: XLA_TARGET is validated before EXLA compiles. With the
current xla 0.10.x stack, cuda13 is accepted; cuda12 remains the
recommended default for CUDA hosts. Unsupported values such as cuda14 fail
fast with a readable Mix.Error.
Path: priv/sakana_trinity/adapted_qwen3_0_6b_layer26/.
Contents:
manifest.json
router_head.safetensors
checkpoints/*.safetensors
The router head sha256 invariant (Phase 3 promoted) is
7ff2db0e…6c09be. Validate with:
XLA_TARGET=cuda12 mix trinity.env.check --artifact-dir \
priv/sakana_trinity/adapted_qwen3_0_6b_layer26The dir is gitignored by design: it is a generated artifact, not source. Copy it from a blessed bundle on first install.
The router emits an agent slot id (0..6). The slot label in the Sakana
checkpoint (gpt-5, gemini-2.5-pro, ...) is training metadata, not a
provider binding.
Read docs/agent_slot_provider_mapping.md
for the full contract. Three pools ship:
| Pool | Used when | Slot 0..6 all map to |
|---|---|---|
:default |
safe default for service runs | :openai provider, gpt-4o-mini model |
:mock |
--mock-provider runs and CI |
:mock provider |
:gemini_cli_asm |
Gemini via the ASM CLI lane | :asm provider, gemini-3.1-flash-lite-preview |
A custom pool example lives in
docs/agent_slot_provider_mapping.md
under "Configuring a custom pool". That is the only way to make slot 0
actually call a gpt-5 API; there is no implicit binding.
All budget options default to nil (unbounded). On exceed the orchestrator
returns {:error, {:budget_exceeded, kind, details}} and emits a
:run_failed trace event with the same kind and details.
- Protects against: runaway wall time.
- Reference for a 5-turn loop:
30_000to60_000. - Checkpoint:
:turn_start. - Details:
%{limit_ms, elapsed_ms, checkpoint, turn}.
- Protects against: runaway dispatches.
- Reference for a 5-turn loop:
5(one dispatch per turn) up to10if you expect verifier-revision retries. - Checkpoint:
:before_dispatch. - Semantics:
:max_provider_calls = Nallows exactly N dispatches; the (N+1)th attempt aborts. - Details:
%{limit, observed, checkpoint}(observedisN+1at fail time).
- Protects against: one slow dispatch dragging down the SLA.
- Reference:
30_000for hosted LLMs;60_000for ASM CLI lane. - Checkpoint:
:after_dispatch. - Details:
%{limit_ms, observed_ms, checkpoint, turn}.
- Protects against: a verifier that keeps rejecting forever.
- Reference:
3for a 5-turn loop. - Checkpoint:
:after_verifier_revision. - Counted: only Verifier dispatches whose status is not
:accepted. - Details:
%{limit, observed, checkpoint, turn}.
- Protects against: spend overrun.
- Reference: depends entirely on your pool's per-call cost.
- Requires
:cost_estimator_fnto actually fire. - Checkpoint:
:after_dispatch. - Details:
%{limit_usd, observed_usd, checkpoint, turn}.
:cost_estimator_fn signature is (dispatch_map) :: float() where
dispatch_map includes :provider, :provider_model, :response_text,
:mode, :provider_latency_ms. The orchestrator deliberately does not
ship a pricing table; you supply the function with your current vendor
pricing.
Without a :cost_estimator_fn, setting :max_estimated_cost_usd
non-nil triggers a one-shot Logger.warning/1 per run_loop/4 call
that the budget will not fire.
Orchestrator.run_loop/4 accepts :trace with a keyword payload:
trace: [
enabled: true,
sink: {:jsonl, "/var/log/trinity/run_2026-05-20.jsonl"},
run_id: "deploy_42",
content: :hash
]content: :hash (default) redacts free-text content; content: :full
keeps it (for debug only — do not enable in production unless your
storage is locked down). The redactor scrubs any api_key,
authorization, password, secret, or token map key recursively.
Suggested rotation: rotate JSONL files daily by run_id or by date.
Each line is a single Event record.
Trace events emitted per turn (in order):
| Event | Meaning |
|---|---|
:run_started |
Once per run_loop/4 call |
:turn_started |
Once per turn |
:slm_extracted |
Hidden-state vector ready |
:route_selected |
Coordinator head produced agent + role |
:provider_called (status=started) |
Dispatch attempt |
:provider_called (status=ok or error) |
Dispatch outcome |
:turn_completed |
Includes :route_decision map (Phase 11) |
:run_completed |
Once on verifier acceptance |
:run_failed |
Once on any error path (incl. budget) |
The :route_decision field in :turn_completed carries a JSON-safe map
with :agent_id, :role_id, :role_name, :margins,
:selection_modes, :transcript_hash, and :artifact_identity.
| Error tuple | Meaning | Likely cause | Action |
|---|---|---|---|
{:error, :coordinator_load_error, msg} |
Coordinator could not load | Wrong XLA_TARGET, missing artifact dir, or EXLA not built for CUDA |
mix trinity.env.check --artifact-dir <path> |
{:error, :missing_slm_context} |
No SLM context supplied | Caller passed neither :slm_context nor :extractor_fn |
Supply one (test or production) |
{:error, :verifier_before_worker_response} |
Verifier role selected before any Worker dispatched | Prompt biases verifier on turn 0 | Re-prompt with a non-verifier opener or call RoleInjector.role_atom("Worker") first |
{:error, {:provider_dispatch_failed, reason, latency_ms}} |
Provider returned an error | Auth, network, vendor outage | Inspect reason; retry if transient |
{:error, {:budget_exceeded, :wall_time, _}} |
:max_wall_time_ms hit |
Loop took too long | Raise the limit or shrink the prompt |
{:error, {:budget_exceeded, :provider_calls, _}} |
:max_provider_calls hit |
Loop tried more dispatches than budget | Raise or accept partial answer |
{:error, {:budget_exceeded, :provider_latency_ms, _}} |
One dispatch was too slow | Slow vendor, slow network, or the vendor is rate-limiting | Raise limit, switch pool, or backoff |
{:error, {:budget_exceeded, :verifier_revisions, _}} |
Verifier rejected :max_verifier_revisions + 1 times |
Bad prompt, bad worker, or genuinely-hard task | Raise limit, change prompt, or escalate |
{:error, {:budget_exceeded, :estimated_cost_usd, _}} |
Cost exceeded | Cumulative dispatches above limit | Raise limit or shrink loop |
{:error, :max_turns_reached} |
Loop hit :max_turns with no verifier accept and no latest worker response |
Misconfigured roles or prompt | Inspect trace |
:max_turns_reached is distinct from :max_turns_latest_worker_response
(:ok result): the latter happens when the loop hits the turn cap with
a partial answer; the former when no Worker ever produced an answer.
The governed-authority path in
README.md § "Running The Router"
is the production way to ship credentials. The orchestrator code does
not read System.get_env/1 directly; runtime env reads live in
config/runtime.exs or a Config.Provider.
Trace output records provider/model labels, opaque refs, hashes, and fixed redaction markers, never materialized secret values.
# Stop the runtime (your supervisor / systemd unit).
sudo systemctl stop trinity-coordinator
# Copy new artifact.
rsync -av --delete /path/to/new/adapted_qwen3_0_6b_layer26/ \
priv/sakana_trinity/adapted_qwen3_0_6b_layer26/
# Validate.
XLA_TARGET=cuda12 mix trinity.env.check --artifact-dir \
priv/sakana_trinity/adapted_qwen3_0_6b_layer26
XLA_TARGET=cuda12 mix trinity.hitl.adapted
XLA_TARGET=cuda12 mix run examples/qwen_router_prompt_eval.exs \
--snapshot examples/fixtures/qwen_router_prompt_eval_logits.json \
--determinism-runs 2
# Restart.
sudo systemctl start trinity-coordinatorIf the eval script's default margin floors fire (Phase 11 Option D —
--min-agent-margin 0.24, --min-role-margin 1.06), the new artifact
materially changed routing confidence. Compare to the previous
snapshot before promoting.
Recommended sink today: JSONL file plus a logrotate or daily file
naming convention. Each line is one Event record.
A Telemetry bridge is on the roadmap (not landed). When it lands,
trinity_coordinator will emit Telemetry events that mirror the JSONL
event schema, so external Phoenix or Telegraf consumers can subscribe
without parsing files.
sudo systemctl stop trinity-coordinator
rsync -av --delete /path/to/previous/adapted_qwen3_0_6b_layer26/ \
priv/sakana_trinity/adapted_qwen3_0_6b_layer26/
XLA_TARGET=cuda12 mix trinity.env.check
XLA_TARGET=cuda12 mix trinity.hitl.adapted
sudo systemctl start trinity-coordinatorConfirm:
XLA_TARGET=cuda12 mix run examples/qwen_router_prompt_eval.exs \
--snapshot examples/fixtures/qwen_router_prompt_eval_logits.json \
--determinism-runs 237/37 PASS is the expected result.