Русский · English
Deep dive into the system's design. If the README answers «what does this do», this document answers «how exactly is it built inside». Read top to bottom: problem → five layers → key mechanisms → state model → flow.
v0.4 (2026-06-11): the pre-gate engagement-workflow engine gains opt-in activation flags (
args.A), each default-OFF and byte-inert when off —consGuard,repoPortable,contracts,replan,renderEval,cheapTiers(see §8.5). Plus theacceptance-protocolskill split into a hub + 6 references, precheck check-counts corrected (S=6 / M=13 / L=21), conductor-sideevents.jsonlemission for the pre-gate cascade, and a newrender-evalvalidator (59 agents · 30 validators). SeeCHANGELOG.md.v0.3 (2026-06-05): the pre-gate cascade (plan → deliver → validate → handoff → gate) is now the engagement-workflow Workflow, conducted by the main loop and stopping at the handoff seam; the LangGraph human-gate (consilium → directive → manager acceptance) remains the active path after the seam. Domain leads are planning-only (the
lead:planstep); specialist coordination is structural — encoded as waves in the lead's plan. See §8.5 andCHANGELOG.md.v0.2.4 (2026-05-28): Windows-compatibility bugfix —
claude.CMDnpm-wrapper truncates multiline argv at the first newline (CMD line-parsing);subprocess.run(text=True)decodes UTF-8 Russian as cp1251 on Russian-locale Windows;consilium_synth_completedledger emit was passing raw natural verdict to a schema expectingACCEPT/REJECT/DIRECTED. All three fixed across 4 scripts (adversary_lg.py/validator_lg.py/engagement_lg.py/scripts/lib/precheck/common.py):find_claude_cmd()resolves.CMD→claude.exevia npm wrapper layout (Unix/macOS path unaffected); 10subprocess.runsites gotencoding="utf-8", errors="replace"; inlineVERDICT_MAPmirror of the per-role mapping in_make_finalize_node. All--invoker mocktests passed pre-fix; latent risk lived in real subscription mode untested on Windows until real subscription mode surfaced them.v0.2.3 (2026-05-28):
engagement_lg.pyend-to-end across all 11 nodes (Send fan-out to specialists,validator_lg.py+adversary_lg.pysubprocess delegation, manager subprocess with canonical-verdict parse, REJECT_NOW short-circuit, engagement-archive on ACCEPT). NEW--mockexecution mode (mutually exclusive with--real) runs real graph paths but with canned subprocess wrappers — enables full end-to-end smoke without claude CLI. 7 paths verified on synthetic engagements. SeeCHANGELOG.mdfor the v0.2.3 delta.v0.2.2 (2026-05-28): incremental — modular decomposition of
handoff-precheck.pyintoscripts/lib/precheck/package (8 topic modules); the third LangGraph engine added —engagement_lg.pyowning the engagement-level lifecycle with 8 node placeholders, 3 HITL pause points, and intake/plan wired tosize-detect.py --auto-promote+claude -p --agent {domain}-leadsubprocess. 3 new ledger payload types (engagement_completed/phase_skipped/dryrun_marker). SeeCHANGELOG.mdfor the v0.2.2 delta.v0.2.1 (2026-05-28): this document reflects the full v0.2 + refinement state. Major shifts versus v0.1: acceptor / system-optimizer split (
*-manager≠*-director), authority-and-conflict-resolution invariant, per-engagement reflection layer, append-only event ledger (engagement/events.jsonl), canonical validator envelope, tier-keyed validator dispatch (validator_lg.py --autodefault on M/L), critical-pause HITL invalidator_lg.py. v0.2.1 added per-role consilium events inadversary_lg.py, golden-set parity across all 3 domains, and hot-path / cold-path split in 3 heavily-loaded skills.
Multi-agent pipelines on a single model family suffer from three systemic pathologies:
-
Framing contamination. When the same model plays multiple roles (writer, reviewer, judge), they share the same blind spots. A «second pass with the same brain» yields the same answer in different words.
-
Goodhart on validators. Validators degenerate into format-gates, checking field presence instead of thinking quality. The bar drifts from «is this right?» to «does it pass the check?».
-
Undifferentiated rigour. A button tweak and a landing redesign go through the same pipeline with the same review depth. Small tasks pay the cost of large ones; large ones don't get enough attention.
The system addresses these with:
- Tier-aware dispatch (S/M/L) — different review depth based on task size
- Filesystem-isolated adversary in fresh subprocesses
- Cross-family second opinion via Codex MCP (different model family = different blind spots)
- Human supreme judge at one critical transition
- Acceptor / optimizer separation —
*-manageraccepts engagements,*-directorimproves the corpus (Codex proposes edits, the director judges against a golden set) - Event ledger records the full lifecycle in
engagement/events.jsonlfor post-hoc replay and analytics
flowchart TB
subgraph Human ["Human layer"]
H1[Trigger phrase in chat]
H2[Supreme judge on M/L acceptance]
H3[Commons-maintainer for SkillOpt promotions]
end
subgraph Agents ["Agents layer · 59 agents"]
AM[Managers · 3]
AD[Directors · 3]
AL[Leads · 3 · plan-only]
AS[Specialists · 20]
AV[Validators · 30]
end
subgraph Skills ["Skills layer · 46 skills"]
SK1[Agency protocol · 8]
SK2[Dev methodology · 16]
SK3[Design methodology · 8]
SK4[Marketing methodology · 5]
SK5[Yandex hub · 6]
SK6[Skill development · 3]
end
subgraph Orch ["Orchestration · Workflow engines + 17 main + 3 optional scripts"]
O1[Mechanical gates]
O2[engagement-workflow · pre-gate Workflow]
O3[Adversary bridge · LangGraph]
O4[Validator bridge · LangGraph]
O5[Consilium synth + present]
O6[Director-verdict check]
O7[Archival]
O8[Shared libs · lib/ledger.py + lib/precheck/]
end
subgraph State ["State layer · engagement/"]
ST1[criteria.md · plan.md · handoff.md]
ST2[validation-outputs · executor-reports]
ST3[consilium-summary · human-directive · acceptance-log]
ST4[engagement-reflections · events.jsonl]
end
Human <--> Agents
Agents <--> Skills
Agents <--> Orch
Orch <--> State
Agents <--> State
classDef human fill:#fef3c7,stroke:#d97706,color:#000
classDef agents fill:#dbeafe,stroke:#2563eb,color:#000
classDef skills fill:#dcfce7,stroke:#16a34a,color:#000
classDef orch fill:#fce7f3,stroke:#db2777,color:#000
classDef state fill:#e9d5ff,stroke:#9333ea,color:#000
class H1,H2,H3 human
class AM,AD,AL,AS,AV agents
class SK1,SK2,SK3,SK4,SK5,SK6 skills
class O1,O2,O3,O4,O5,O6,O7,O8 orch
class ST1,ST2,ST3,ST4 state
Each layer has a clear scope of responsibility:
| Layer | What it does | What it does NOT do |
|---|---|---|
| Human | Triggers intake, issues supreme verdict on M/L, approves SkillOpt promotions | Doesn't write markdown, doesn't validate routinely |
| Agents | Plan, execute, judge, run the SkillOpt loop (out-of-band) | Don't write scripts, don't determine tiers; leads plan but don't dispatch (the engagement-workflow does), managers don't dispatch |
| Skills | Loaded into agent context — methodology, protocols, tool guides | Not invoked directly from chat; some are triggers:-aware for the loader |
| Orchestration | Mechanical gates, adversary bridge, consilium, event ledger | Doesn't make judgments about content quality |
| State | Stores all engagement state on FS — every fact lives in a file | No logic — only schemas, whitelist, mutability rules |
Skills are methodology references and protocols. Loaded into agent
context via skills: in frontmatter. A skill has no chat trigger by
default; PROTOCOL skills now expose a triggers: frontmatter field
read by the loader.
| Category | # | Purpose |
|---|---|---|
| Agency protocol | 8 | Agency work contract: schemas, lifecycle, gates, acceptance, system-optimization |
| Dev methodology | 16 | TDD, code review, spec planning, deploy, security |
| Design methodology | 8 | Brand, design system, UI/UX, presentation |
| Marketing methodology | 5 | SEO, semantic drift, AI visibility, task decomposition, benchmark research (standalone entry-point) |
| Regional SEO/PPC stack | 6 | API integrations for Russian-market analytics platforms |
| Skill development | 3 | Authoring and testing of skills themselves |
---
name: code-writing
domain: dev
triggers:
- "loaded by every lead / manager via skills frontmatter"
description: |
[METHODOLOGY] Universal quality coding process (plan, TDD, reviews).
---| Tag | Meaning |
|---|---|
[PROTOCOL] |
Mandatory contract (read by agents, not invoked) |
[METHODOLOGY] |
Methodology reference (how to do work properly) |
[TOOL] |
Description of an integration with a specific tool |
agency-intake— single entry point. Classifies the domain (dev/design/marketing), createscriteria.md, hands the engagement to the matching domain-lead.engagement-protocol— canonical contract: schemas of all artefacts, whitelist of paths, mutability rules, iteration budget, authority invariant.acceptance-protocol— per-engagement acceptance methodology used by all*-manageragents (tier-aware S/M/L verdict shape, reflection emission).system-optimization-protocol— SkillOpt loop used by all*-directoragents (reflect → bounded edit → golden gate → promote / reject).engagement-contract— minimal specialist subset of engagement protocol; loaded by 20 specialists via frontmatter.validation-pipeline— which validator to run when, in what order, how to log tovalidation-log.md; tier-keyed dispatch matrix +validator_lg.py --autodefault on M/L.docs-pipeline— documentation-diff artefact handling.codex-bridge— Codex CLI integration as an MCP server for cross-family adversary and image generation.
Three heavily-loaded skills have their cold sections moved into
on-demand references/{topic}.md files:
engagement-protocol/SKILL.md(1128 → 963 lines) + 7 references:cross-domain,dangerous-ops,archival,ux-heavy,abort,resume,budget.ui-ux-methodology/SKILL.md(664 → 347 lines) + 2 references:quick-reference(10 priority categories),professional-ui-rules(Common Rules + Pre-Delivery Checklist).dev-methodology/SKILL.md(441 → 351 lines) + 2 references:skills-ecosystem,agents.
Net effect: −572 lines loaded into every engagement that pulls these skills; +515 lines parked in references that load only when the hot-path summary indicates the section is relevant.
59 Claude Code subagents in ~/.claude/agents/. Each agent has
frontmatter with name, description, model, skills:,
allowed-tools:. The mirror organises agents into
agents/{directors,leads,managers,specialists,validators}/ subdirectories.
dev-manager, design-manager, marketing-manager. Per-engagement
acceptor role: judge between producer and adversary, never plans,
never dispatches, never edits authorial artefacts, never re-runs
validators sweep-style.
| Action | Manager |
|---|---|
Read handoff.md, validation-outputs/*, consilium-summary.md, human-directive.md |
✓ |
Write verdict per directive in acceptance-log.md |
✓ (M/L only) |
Adjudicate on each adversary↔author disagreement (UPHELD / OVERRULED / DEFERRED) |
✓ |
Emit 0–3 actionable reflections to engagement-reflections.md |
✓ (M/L only) |
| Dispatch work | ✗ |
| Edit handoff content | ✗ |
| Re-run validators sweep-style | ✗ |
| Targeted single validator re-run on L-tier when adversary identifies coverage gap | ✓ (max 1 per validator per iteration) |
| Possible verdicts | ACCEPT / REJECT / ESCALATE |
On S-tier the manager isn't engaged — producer self-attests + mechanical checks + human glance.
dev-director, design-director, marketing-director. Per-domain
system-optimizer role (out-of-band, not per-engagement). Runs the
SkillOpt-style skill-evolution loop on accumulated REJECT/rework
signals from the manager's skill-evolution-log.md:
- Reflect — cluster signals by
target × class(rule_missing/rule_wrong/rule_ignored); fires only at ≥3 same-class signals. - Codex proposes bounded edits (cross-family — kills defend-bias). Edit budget L: 4–6 patches per cycle, ≤10 lines each.
- Judge against golden set — director (not Codex) verifies the
edit doesn't regress any scenario in
skills/system-optimization-protocol/golden/{domain}/. - Promote or reject. Passing edits promote to the
~/.claude/tree. Rejected edits land in<your-memory>/skill-rejected-edits.md(negative memory; read before next cycle).
The director never authors edits itself — Codex is the proposer, the director is the judge, the human is the commons-maintainer for cross-domain promotions.
dev-lead, design-lead, marketing-lead — planning-only. Each is
the lead:plan step inside the engagement-workflow: it reads
criteria.md and returns the plan — specialists, validators, and tasks
grouped into waves with dependencies. The lead does NOT dispatch; the
Workflow script fans specialists out by task.owner, wave by wave
(same-wave tasks run in parallel on disjoint files; a later wave may
depend on an earlier one). The coordination a separate coordinator layer
would otherwise provide is structural here — encoded as the wave
grouping in the plan, so tier complexity scales by adding waves, not by
adding a routing tier.
Doers. Receive an atomic task from a lead, do the work, hand back a
report in engagement/executor-reports/. All 20 specialists load
engagement-contract skill (6-bullet specialist subset of the engagement
protocol) via frontmatter.
- Dev (8):
dev-backend-engineer,dev-frontend-engineer,dev-fullstack-engineer,dev-devops-engineer,dev-qa-engineer,dev-tech-architect,dev-product-analyst,dev-technical-writer - Design (5):
design-ux-designer,design-ui-designer,design-visual-designer,design-brand-strategist,design-presentation-designer - Marketing (7):
marketing-copywriter,marketing-banner-designer,marketing-seo-specialist,marketing-ppc-specialist,marketing-keyword-researcher,marketing-web-analyst,marketing-ai-visibility-specialist
Narrowly-specialized reviewers that the lead invokes for a specific
reason. Each has a skill-binding with a review methodology and returns
a structured JSON report saved to engagement/validation-outputs/.
With v0.2 every output gets a canonical envelope alongside the
raw fields (normalized verdict + severity + validator_type).
| Validator | What it checks |
|---|---|
code-reviewer |
Code quality after implementation |
security-auditor |
OWASP Top 10, secrets leakage, auth flows |
accessibility-validator |
WCAG AA, ARIA, keyboard nav, contrast |
performance-validator |
N+1, memory leaks, hot paths |
migration-validator |
DB migration safety (atomicity, rollback) |
test-reviewer |
Test quality and strategy |
reality-checker |
Cross-check task files against actual code state |
skeptic |
Mirage detection — non-existent files / functions / deps |
completeness-validator |
Bidirectional spec → tasks traceability |
task-validator |
Task file vs template compliance |
tech-spec-validator |
Tech-spec template compliance |
userspec-quality-validator |
User-spec quality (structure, edge cases) |
userspec-adequacy-validator |
Adequacy of solution for the stack |
feasibility-assessor |
Research verdict validation |
infrastructure-reviewer |
Setup quality (folder structure, hooks, Docker) |
deploy-reviewer |
CI/CD pipeline, secrets management |
pre-deploy-qa |
Acceptance testing before deploy |
post-deploy-qa |
Acceptance testing after deploy on live env |
interview-completeness-checker |
Interview completeness in user-spec planning |
documentation-reviewer |
Project-knowledge documentation quality |
prompt-reviewer |
LLM prompt quality |
anti-pattern-detector |
Hidden failure modes in diffs |
ux-review |
Exercised narrative on ux-heavy engagements (drive mode: re-exercises a live preview via Playwright when a URL is supplied) |
render-eval |
Renders artefact-mode HTML in a real browser, checks the rendered result against contract assertions / criteria |
code-researcher |
Codebase research for a feature (Layer 5) |
design-system-researcher |
Existing design system audit before redesign (Layer 5) |
brand-context-researcher |
Existing brand history audit before brand work (Layer 5) |
product-context-validator |
Cross-domain product coherence |
task-creator |
Task file generation from tech-spec |
skill-checker |
Skill compliance with skill-authoring standards |
17 main + 3 optional Python scripts in ~/.claude/scripts/. All scripts
are exit-code gates: a non-zero exit blocks the pipeline, no
prompts, no negotiations. No model judgment — purely deterministic
logic.
| Script | Purpose |
|---|---|
adversary_lg.py |
LangGraph adversary bridge — 5 reviewer roles, two-pass curated-view isolation, Send fan-out, SQLite-checkpointed --resume, native HITL via interrupt(), event ledger wired (lifecycle + per-role + early-return guards) |
validator_lg.py |
LangGraph atomic-validator fan-out via Send, retry edge, auto-plan from criteria.md predicates, --resume, native HITL via --interrupt-on-critical, canonical envelope, event ledger wired (8 emit sites) |
ledger-emit.py |
CLI emitter for the append-only event ledger (engagement/events.jsonl) |
skillopt-ready.py |
SkillOpt due-signal harvester — clusters manager-emitted signals + orphan reflections by target × class and reports when a director cycle is due |
check-agent-models.py |
Roster-agnostic lint — fails if any agent is missing a model: line or declares a tier outside {opus, sonnet, haiku}; hardcodes no agent names |
consilium-synth.py |
Adversary output aggregation, two-stage dedup |
consilium-present.py |
Chat-ready format with decision menu for the human |
director-verdict-check.py |
Mechanical adjudication completeness (legacy name; in v0.2 targets *-manager verdict) |
handoff-precheck.py |
Tier-aware hard gate (S=6 / M=13 / L=21 checks), event ledger wired (per-check emit) |
human-directive.py |
Scaffold human-directive.md from CLI args |
preflight.py |
Tools availability (Codex CLI, Python deps) |
danger-scan.py |
Registry of dangerous operations (DROP TABLE, force-push, prod deploy) |
handoff-paths-check.py |
Phantom path detection |
cross-val-check.py |
Verbatim quote verification in §4a table |
trace-schema-check.py |
Trace JSON schema + staleness check |
size-detect.py |
Tier detection at intake / runtime with --auto-promote |
engagement-archive.py |
Idempotent engagement archival |
lib/ledger.py— append-only event ledger. Per-engagementengagement/events.jsonl. Schema v1 with 17 fields per event, 28 KNOWN_PAYLOAD_TYPES, assert guards, forward-only with syntheticlegacy_importevent for pre-ledger engagements, replay-friendly schema versioning. Helpers:emit_authority_conflict(),emit_skill_snapshot(),snapshot_skills(),hash_input().lib/precheck/— modular precheck package (v0.2.2). 8 topic modules:common(constants, shared utilities),criteria(frontmatter validation),handoff(handoff.md structural checks),iteration(counter consistency),validators(validation-outputs/* presence and shape),acceptance(acceptance-log + director-verdict),danger(danger-scan registry), plus__init__re-exports.handoff-precheck.py(1264 → 423 lines, CLI/dispatcher only) imports from this package. Topic-clustered for the way managers / leads think about the system, not by mechanism. Byte-identical JSON output to the pre-refactor monolith on real engagement smoke.
engagement-doctor.py, engagement-migrate.py, token-budget.py —
opt-in utilities outside the core protocol.
adversary.py (legacy stdlib-only bridge) was removed in v0.2 after
adversary_lg.py absorbed its functionality and the auto-synth pipeline
landed. The 5 dead optional scripts (cross-val-template,
director-sweep, metrics-summary, secondary-init, validator-retry)
were removed in an earlier cleanup.
An engagement is a directory. State is read entirely from the filesystem. No databases, no external logs, no conversation state — if a fact matters, it lives in a file.
engagement/
├── criteria.md # secretary, semi-immutable
├── scope-sync.md # manager, optional, append-only
├── plan.md # lead, frozen after first dispatch
├── specs/ # dev: user-spec / tech-spec / research-verdict
├── tasks/ # all domains: atomic task files
├── tasks/INDEX.md # mandatory if size: L
├── brand/ # design only
├── design-system/ # design only
├── ui/ # design only
├── executor-reports/ # specialists, append-only
├── code-research.md # OPTIONAL — code-researcher output
├── design-research.md # OPTIONAL — design-system-researcher output
├── brand-research.md # OPTIONAL — brand-context-researcher output
├── validation-log.md # lead, append-only
├── validation-outputs/ # JSON proof-of-run for each validator (raw + canonical)
├── consilium-summary.md # auto-write from consilium-synth.py (M/L)
├── human-directive.md # mandatory on M/L
├── codex-outputs/ # Codex assets (optional)
├── iteration # plain-text counter
├── screens/iter-N/ # mandatory for ux_heavy
├── traces/iter-N/ # JSON traces of flow logs
├── deploy-log.md # dev only when deploy boundary crossed
├── docs-diff.md # docs pipeline only
├── handoff.md # lead, REPLACED per iteration
├── acceptance-log.md # manager, append-only
├── engagement-reflections.md # manager, append-only on M/L verdict (0–3 actionable lessons)
└── events.jsonl # append-only event ledger
Files outside the whitelist are a protocol violation. The
manager's red-flag check does ls engagement/ — any extraneous file
is grounds for REJECT with reason «out-of-whitelist artefact».
| Artefact | Mutability |
|---|---|
criteria.md |
semi-immutable (additions/removals — only by user via scope-sync) |
plan.md |
mutable until first dispatch, then frozen |
handoff.md |
replaced wholesale per iteration |
validation-log.md, acceptance-log.md, executor-reports/*.md, engagement-reflections.md, events.jsonl |
append-only |
validation-outputs/*.json |
immutable after write |
Tier is determined at intake from criteria.md frontmatter:
---
engagement: <name>
domain: dev | design | marketing
size: S | M | L
ux_heavy: false | minor | true
tools_required: [...]
---| Tier | Use case | Schema | Adversary | Manager | Validator dispatch | Mechanical checks |
|---|---|---|---|---|---|---|
| S | Hotfix, single deliverable | 4-section minimum | None | None | engagement-workflow validate phase | 6 |
| M | Multi-specialist, mid-stakes | Full 11-section | 1× peer-opus | Judge mode | engagement-workflow validate phase | 13 |
| L | Cross-domain, high-stakes | Full + tasks INDEX | 5× consilium | Judge + adjudication | engagement-workflow validate phase | 21 |
Tier is chosen at intake by heuristics: number of specialists involved,
cross-domain nature, ux-heavy flag, risk profile. size-detect.py --auto-promote may promote a tier (S→M, M→L; never demote) when the
engagement grows beyond its original classification.
The adversary runs via adversary_lg.py — a LangGraph StateGraph — in
a fresh subprocess with a filesystem-curated view. Two-pass
design separates «what would I say cold?» from «what would I say after
seeing the author's reasoning?». The graph models reviewer fan-out as
Send edges and the codex-informed dependency as a conditional edge;
--resume is artefact-driven — it re-runs only the missing or failed
reviewers.
sequenceDiagram
participant L as Lead
participant SC as adversary_lg.py
participant FS as engagement/ (real)
participant CV as curated copy
participant ADV as Adversary subprocess
participant OUT as validation-outputs/
participant LD as events.jsonl
L->>SC: --consilium {M|L}
SC->>LD: engagement_started
SC->>FS: read full state
SC->>CV: copy WITHOUT handoff.md, acceptance-log,<br/>consilium-summary, human-directive,<br/>validation-outputs/
SC->>LD: consilium_started (per role, v0.2.1)
Note over CV,ADV: Pass 1 — Blind
SC->>ADV: spawn fresh subprocess with curated view
ADV-->>OUT: {role}-iter-N-preliminary-{ts}.json
Note over CV,ADV: Pass 2 — Informed
SC->>ADV: spawn fresh subprocess with FULL state +<br/>preliminary findings injected via prompt
ADV-->>OUT: {role}-iter-N-{ts}.json (final)
SC->>LD: consilium_role_completed (verdict mapped, v0.2.1)
Note over OUT: Delta preliminary→final =<br/>framing contamination signal
SC->>LD: consilium_synth_completed
| Role | Model | Scope |
|---|---|---|
peer-opus |
Anthropic Opus | Peer-level adversarial review |
codex-blind |
OpenAI GPT-5 (Codex CLI) | Fully independent, no prior findings |
codex-informed |
OpenAI GPT-5 (Codex CLI) | Reads peer-opus, focuses on gaps |
sonnet-scoped |
Anthropic Sonnet | «Average human» common-sense scope |
haiku-scoped |
Anthropic Haiku | Naive obvious-miss scope, format checks |
Capability rule: the adversary is never weaker than the producer.
Tier presets:
- S = adversary doesn't run
- M =
peer-opusonly - L = all 5 roles in parallel (cross-family disagreement detection)
Every adversary run writes to engagement/events.jsonl via the shared
lib/ledger.py shim. Lifecycle events: engagement_started,
consilium_synth_completed, interrupt_paused/resumed,
human_directive_received. Per-role events (v0.2.1):
consilium_started (before two-pass), consilium_role_completed
(after two-pass; payload includes verdict mapping
satisfied → ACCEPT, rework_required / suspicious_too_clean →
REJECT, fail → REJECT).
The engagement-workflow is a Claude Code Workflow-tool script
(workflows/engagement-workflow.js) that the main loop conducts after
agency-intake locks criteria.md. It owns the whole pre-gate cascade —
from planning to the handoff gate — then STOPS at the handoff seam,
where the LangGraph human-gate (consilium -> directive -> manager
acceptance, sections 10-11) takes over. The boundary is the human
decision: everything before it is the Workflow; the human-pause and
acceptance machinery stays LangGraph.
flowchart TB
START([main loop invokes])
PLAN[discovery · lead:plan<br/>reads criteria.md → tasks / waves / validators]
DEC[decompose · gated<br/>L, or M with ≥2 specialists → tasks/*.md + INDEX]
DEL[deliver · specialist waves<br/>isolated git worktrees · per-task review→rework · per-wave consolidate]
VAL[validate<br/>validators in parallel + adversarial-verify each finding]
HO[handoff · tier-aware sections]
GATE[gate · handoff-precheck.py]
SEAM([handoff seam → LangGraph human-gate])
STOP([hard-stop · readyForAcceptance = false])
START --> PLAN --> DEC --> DEL --> VAL --> HO --> GATE --> SEAM
DEL -->|blocked / review-failed / malformed plan / consolidation-failed (consGuard)| STOP
classDef node fill:#dbeafe,stroke:#2563eb,color:#000
classDef term fill:#dcfce7,stroke:#16a34a,color:#000
class PLAN,DEC,DEL,VAL,HO,GATE node
class START,SEAM,STOP term
- discovery (
lead:plan) — the domain lead runs as the planning step: readscriteria.md, sizes the engagement, and returns tasks, validators, and a wave grouping (with dependencies). The lead does not dispatch. - decompose (gated: tier L, or M with >=2 specialists) — writes the
atomic
tasks/*.md+tasks/INDEX.md. - deliver — sequential waves; within a wave, one specialist per task in
parallel, each in its OWN git worktree branched off the current
integration HEAD (so a later wave sees earlier waves' merged work). Each
task carries a scoped review -> bounded-rework loop (tier budget S=1 /
M=2 / L=3). At the wave barrier the phase consolidates: code mode
octopus-merges the disjoint branches (overlap -> sequential-merge +
resolver) and runs repo-root tests; artefact mode (design / marketing
deliverables not merged as code) manifest-verifies each file exists and
is non-empty. With the
consGuardflag on, the consolidation RESULT is guarded too — a null consolidator,merge_ok:false, or a code-mode merge that landed but failed repo tests hard-stops the run, so dependent waves never branch off a missing or broken integration HEAD. - validate — every validator in the plan runs in parallel (by
agentType), then each non-info finding is adversarially verified by an
independent skeptic (refuted findings dropped). The phase writes the
per-validator
validation-outputs/{validator}-iter-N-{ts}.jsonwith the canonical envelope — byte-identical to whatvalidator_lg.pywrites, sohandoff-precheck.pyand the manager consume them unchanged. - handoff — assembles the tier-aware
handoff.md. - gate — runs
handoff-precheck.py --mode ready; returnsreadyForAcceptance.
A wave hard-stops the engagement (readyForAcceptance = false, no
consolidation) if any task is blocked, crashes, or does not genuinely pass
its scoped review, or if the plan is malformed (duplicate / unknown / orphan
task ids) — there is no silent proceed past a broken wave. With the
consGuard flag on, a failed consolidation (not just a failed task) also
hard-stops. State is the Workflow run journal: re-invoking with
resumeFromRunId replays completed steps from cache and re-runs only the
changed or failed ones.
The engine ships a set of opt-in flags, passed in the Workflow's
args.A object. Each defaults OFF, and with all flags off the engine
renders byte-for-byte identically to the unflagged path — so a flag is a
per-engagement opt-in, not a global mode change. They let an operator raise
the cascade's rigour (or trade cost) for a specific engagement.
| Flag | Phase | Effect when on |
|---|---|---|
consGuard |
deliver | Guards the consolidation RESULT (above): a failed merge / failed repo tests hard-stops instead of being only logged. A merge that never landed is replan-compatible; a merge that landed-but-failed-tests hard-stops without auto-replan. Bug-fix class. |
repoPortable |
discovery | One detect:repo agent detects the integration branch + test runner instead of hardcoding main / python -m unittest; a non-git repoDir hard-stops early in code mode. |
contracts |
deliver (M/L) | Per-task contract handshake: owner proposes checkable assertions in-band → a neutral reviewer co-signs and writes tasks/{id}.md → ## Contract (co-signed) → owner may contest. Binds the per-task review rubric only; never waives criteria.md. |
replan |
deliver | One bounded replan per run on a wave hard-stop: completed waves locked, remaining work re-planned (ids suffixed -r{n}), re-validated, continue. |
renderEval |
deliver (artefact) | After manifest-verify, renders the wave's HTML artefacts and checks OBSERVED values against the co-signed assertions / criteria (via the render-eval agent). |
cheapTiers |
engine-wide | Mechanical steps (manifest-verify + gate-runner → haiku; adversarial-verify → sonnet) run on cheaper models; judgement steps keep the inherited model. |
A backslash-repoDir guard (validation-only, no flag) rejects
Windows-style paths that would nest worktrees inside the repo.
The pre-gate cascade is deterministic fan-out — plan, parallel waves,
worktree isolation, consolidation, validate — which the Workflow tool
expresses directly (and runs out-of-band, surviving a main-loop
compaction). The human-gate needs a durable interactive pause; that is
LangGraph's interrupt() stronghold, so consilium + acceptance stay there.
The seam is exactly the line between the two. A second Workflow engine,
skillopt-workflow.js, runs the director's SkillOpt cycle out-of-band
(section 12).
Once 5 reviewers finish, outputs are aggregated via
consilium-synth.py:
Stage 1 — Per-finding dedup. Each reviewer's findings are normalized to canonical form. Findings with high textual similarity (configurable threshold) merge into clusters.
Stage 2 — Cluster-level voting. For each cluster, the script computes:
- How many reviewers raised the issue
- Severity distribution
- Consensus strength
- Cross-family disagreement flag — if Anthropic family and Codex diverge, a separate marker for manual review
Output is a ranked consilium-summary.md:
- Consensus findings (≥3 reviewers agree)
- Outliers worth noting (1 reviewer, but high severity)
- Cross-family disagreements (Anthropic vs OpenAI diverge)
consilium-present.py formats the synthesis into a chat-ready summary
with a decision menu — designed to be read in <2 minutes.
On M/L the human gets the final word at one point: after consilium synthesis, before manager verdict.
sequenceDiagram
participant SC as consilium-present.py
participant U as Human
participant HD as human-directive.py
participant FS as engagement/
SC->>U: chat summary (≤2 min to read)
U->>U: decision
alt PROCEED
U->>HD: --decision PROCEED
else REJECT
U->>HD: --decision REJECT --reasons "..."
else DIRECTED
U->>HD: --decision DIRECTED --reasons "what to change"
end
HD->>FS: human-directive.md
Note over FS: scaffold ready for manager
Human input is fast, structured, minimal — the system formats and expands. The user doesn't write markdown, doesn't structure findings, doesn't tag severity.
On M/L the per-engagement acceptor (*-manager) works in judge mode.
Doesn't dispatch, doesn't edit, doesn't re-run validators sweep-style.
Loads acceptance-protocol skill which encodes the tier-aware verdict
shape, reflection-emission constraint, and adjudication rules.
director-verdict-check.py enforces adjudication completeness
mechanically:
- Every adversary finding must have a decision marker in
acceptance-log.md:UPHELD/OVERRULED/DEFERREDwith rationale - Every directive from
human-directive.mdmust have a corresponding acceptance-log item - Any missed finding → exit 1, the manager rewrites the verdict
Validator re-runs are allowed only on L-tier and only if an adversary finding explicitly identifies a validator-coverage gap. Maximum 1 re-run per validator per iteration.
After the verdict, the manager appends 0–3 reflections to
engagement-reflections.md — actionable lessons each targeting a
specific skill/agent rule (target = skills/X | agents/Y + class = rule_missing | rule_wrong | rule_ignored). Zero reflections is a
valid outcome (generic observations get discarded). These feed the
director's monthly reflection sweep, surfacing sub-threshold patterns
that accumulate across engagements.
The *-director role is not part of any engagement. It runs the
SkillOpt-style skill-evolution loop on accumulated signals from
manager-emitted skill-evolution-log.md entries:
flowchart LR
SIG["skill-evolution-log.md<br/>≥3 same-class signals"]
REF[Director reflects<br/>cluster by target × class]
REJ[Read skill-rejected-edits.md<br/>negative memory]
CDX[Codex proposes<br/>bounded edits L=4-6]
GLD[Golden-set gate<br/>per-domain scenarios]
PROM[Promote to live]
REJBUF[Append to<br/>skill-rejected-edits.md]
SIG --> REF
REJ --> CDX
REF --> CDX
CDX --> GLD
GLD -->|pass| PROM
GLD -->|fail| REJBUF
classDef proc fill:#dbeafe,stroke:#2563eb,color:#000
classDef gate fill:#fef3c7,stroke:#d97706,color:#000
classDef out fill:#dcfce7,stroke:#16a34a,color:#000
classDef neg fill:#fee2e2,stroke:#dc2626,color:#000
class SIG,REF,CDX proc
class GLD gate
class PROM out
class REJ,REJBUF neg
Golden-set parity across domains:
| Domain | Scenarios | Failure classes covered |
|---|---|---|
golden/dev/ |
4 (spec-code-drift / flaky-test-masking / security-gap + manager-catches-mis-rendered-consilium) | rule_ignored / rule_missing / rule_wrong |
golden/design/ |
3 (token-drift / aria-missing / dark-contrast-fail) | rule_ignored / rule_missing / rule_wrong |
golden/marketing/ |
3 (keyword-undercount / SEO-claim-uncited / brand-voice-pronoun) | rule_ignored / rule_missing / rule_wrong |
A real cycle fires only when ≥3 real same-class signals accumulate in
<your-memory>/skill-evolution-log.md. The synthetic dry-run on dev
domain (Codex proposed 3 edits, judge accepted 2, 1 entered the
rejection buffer) is documented in the project memory.
A set of hard mechanical checks runs at every transition. All deterministic, exit-code based, no model judgment involved.
| Gate | What it does | When it fires |
|---|---|---|
preflight.py |
Tools availability (Codex CLI, Python deps) | Before any adversary run |
danger-scan.py |
Registry of dangerous operations | Before specialist dispatch |
handoff-precheck.py |
Tier-aware structural verification | Lead → Manager |
handoff-paths-check.py |
Phantom path detection | As part of precheck |
cross-val-check.py |
Verbatim quote verification | precheck (M/L) |
trace-schema-check.py |
Trace JSON schema + staleness | precheck (ux_heavy) |
director-verdict-check.py |
Adjudication completeness | After manager verdict |
size-detect.py --auto-promote |
Tier promotion when engagement grows | Periodic during execution |
Non-zero exit on any gate blocks the pipeline. Fail = stop, fix = retry.
Iteration budget is root-cause-based, not slot-based:
| Approach | What happens |
|---|---|
| Slot-based («3 attempts, then escalate») | Budget gaming — agents try light fixes first, burn slots |
| Root-cause-based («same root cause twice → escalate») | Forces actual engagement with the underlying issue |
Implementation: validation-outputs/*.json carry a root_cause tag
(part of the canonical envelope) that survives across iterations. The
mechanical layer detects repeated root causes and triggers escalation
regardless of attempt count.
Standard limit: 2 rework rounds on M, 3 on L. Past the hard limit, escalate to the user with a scope-sync proposal.
A written 7-rule precedence resolves any disagreement between sources
of behavior. Lives in engagement-protocol/SKILL.md:
- Normative precedence (highest → lowest):
CLAUDE.md> explicit judge decision >criteria.md> PROTOCOL skills > METHODOLOGY skills > agent body > frontmatter. criteria.mdmay add scope / quality bars but may not waive mandatory PROTOCOL gates without an explicit judge waiver.- Frontmatter has zero behavioral authority — only declares what must be loaded.
- Agent body may specialize behavior only where loaded skills are silent; never overrides on the same topic.
- Between same-tier skills, the narrower scope wins unless it weakens a mandatory check; then the stricter rule wins.
- Unresolved conflicts are blocking
authority_conflictevents inevents.jsonl; the human judge resolves before execution continues. - Each engagement snapshots loaded skill names + versions + content hashes at start; mid-engagement edits don't apply until the next phase (or judge approval, ledgered).
The engagement directory IS the audit trail. The picture can be
reconstructed via cat + a single ledger replay:
| Source | What it shows |
|---|---|
iteration (plain-text counter) |
How many iterations occurred |
validation-log.md |
Which validators ran, with what verdict |
validation-outputs/*.json |
JSON proof-of-runs (raw + canonical envelope) |
consilium-summary.md |
What the consilium found on M/L |
human-directive.md |
What the human decided |
acceptance-log.md |
Append-only history of all manager verdicts |
engagement-reflections.md |
Per-engagement reflections feeding SkillOpt |
executor-reports/*.md |
What each specialist did |
events.jsonl |
Append-only event ledger: phases, validators, interrupts, verdicts, reflections, authority conflicts |
The system is inspectable, debuggable, diff-able in git, and works
without additional infrastructure. The event ledger is the primary
observability surface — read via scripts/lib/ledger.py API or by
parsing the JSONL directly.
| Anti-pattern | Blocking mechanism |
|---|---|
| Validator sweep theatre — sweep-style «re-run everything» | Re-runs only point-targeted, with explicit justification through an adversary finding |
| Phantom claims in handoff — references to non-existent files | handoff-paths-check.py + skeptic validator |
| Manager-rewrite of authorial artefacts | The manager only writes the verdict + reflections; handoff content is off-limits |
| Director writing edits to skills directly | Codex proposes, director judges; the director never writes edits itself |
| Out-of-whitelist files in engagement/ | ls engagement/ is checked against the whitelist; any extraneous file = REJECT |
| Format-first validation | Validators check root causes; format is split out into a separate mechanical layer; the canonical envelope enforces verdict / severity normalization |
| Vector-only communication | Findings are passed in full text; dedup is via textual similarity |
| Silent rule drift mid-engagement | Authority invariant rule 7: engagement snapshots loaded skills at start |
| Coordinator layer for 1-specialist work | No separate coordinator layer — the lead plans waves; the engagement-workflow fans specialists out directly by task.owner |
| Reflection bloat | Per-engagement reflection strict constraint: target = skill/agent rule, class ∈ {rule_missing, rule_wrong, rule_ignored}. Zero reflections is valid. |
The single entry point for agency work is a trigger phrase in chat. Both English and Russian are recognized:
agency task: <description>
or
мне надо агенси задачу <description>
Standalone capabilities have separate triggers:
мне надо провести исследование/benchmark research— invokesbenchmark-researchskill (industry reverse-engineering).прогнать skill-evolution/skill evolution cycle— invokes the matching domain director to run the SkillOpt cycle on accumulated signals.
Add or adjust phrasings in the agency-intake skill's Use when:
list to match your team's vocabulary.
The system then runs the engagement autonomously through all layers:
sequenceDiagram
autonumber
participant U as Human
participant ML as Main loop · agency-intake
participant WF as engagement-workflow · Workflow
participant SP as Specialists
participant V as Validators
participant SC as LangGraph + scripts
participant ADV as Adversary subprocess
participant M as Manager · acceptor
participant FS as engagement/
U->>ML: trigger phrase
ML->>FS: classify → criteria.md (S/M/L)
ML->>WF: invoke engagement-workflow
rect rgb(245, 245, 250)
Note over WF,V: Pre-gate cascade (Workflow) — stops at the seam
WF->>WF: discovery · lead:plan → plan.md (tasks / waves / validators)
WF->>SP: deliver — specialist waves in git worktrees (review→rework)
SP->>FS: executor-reports/ + consolidated work
WF->>V: validate — validators in parallel + adversarial-verify
V->>FS: validation-outputs/*.json (raw + canonical envelope)
WF->>FS: handoff.md
WF->>SC: gate · handoff-precheck.py
SC-->>WF: exit 0 / fail
end
WF-->>ML: readyForAcceptance — handoff seam
alt M/L tier
rect rgb(245, 250, 245)
Note over SC,ADV: Adversary phase (LangGraph human-gate)
ML->>SC: adversary_lg.py --consilium {M|L} --interrupt
Note over ADV,FS: events.jsonl: consilium_started / consilium_role_completed per role
ADV->>FS: validation-outputs/{role}-*-preliminary.json
ADV->>FS: validation-outputs/{role}-*-final.json
SC->>FS: consilium-synth.py → consilium-summary.md
SC->>U: consilium-present.py (chat summary)
end
rect rgb(250, 248, 240)
Note over U,M: Human + Manager phase
U->>SC: PROCEED / REJECT / DIRECTED
SC->>FS: human-directive.py → human-directive.md
ML->>M: invoke {domain}-manager (judge mode)
M->>FS: acceptance-log.md per directive
M->>FS: engagement-reflections.md (0–3 actionable)
M->>SC: director-verdict-check.py
end
else S tier
Note over U: Human glance — accept / reject directly
end
ML->>FS: engagement-archive.py — idempotent archival (on ACCEPT)
S-tier skips adversary, consilium, and manager phase: producer self-attests, mechanical checks gate, the human accepts directly.
M-tier adds 1× peer-opus adversary, a manager-judge phase, and 0–3 post-verdict reflections.
L-tier adds the full 5-role consilium, cross-family adjudication in the manager verdict, and tasks INDEX as a required artefact.
The director (*-director) is not part of this flow — it runs the
SkillOpt loop out-of-band when ≥3 same-class signals accumulate.