Every subagent must report its status in a standard format. No silent failures. No ambiguous outcomes. If it ran, it reports.
Category: Observability | Complexity: Low | Prerequisite: Task Envelope
You spawn three subagents in parallel. Two return results. The third... nothing. Did it succeed? Is it still running? Did it crash? Is it hallucinating a response? You don't know, and now you can't safely proceed because the downstream task depends on all three outputs.
Silent failures are the #1 operational problem in multi-agent systems.
Every subagent must emit a structured status event when it finishes (or when it gets stuck):
COMPLETED {task_id}: completed (output: {file_path}, {size})
FAILED {task_id}: FAILED (error: {error_message})
RUNNING {task_id}: running (step {n}/{total}: {description})
BLOCKED {task_id}: blocked (reason: {reason})
| Status | Meaning | Orchestrator Action |
|---|---|---|
completed |
Task done, output available | Validate output against acceptance criteria |
FAILED |
Task failed, no usable output | Pause downstream tasks, decide retry/escalate |
running |
Task in progress, checkpoint available | Wait (or timeout if too long) |
blocked |
Task cannot proceed, needs input | Resolve blocker or escalate |
Without structured events, this happens:
- Agent A fails to fetch API data (gets a 401 error)
- Agent A returns nothing, or returns an empty file
- Agent B reads the empty file and writes an analysis anyway -- full of hallucinated data
- The orchestrator sees a plausible-looking analysis and accepts it
With structured events:
- Agent A emits:
FAILED api-fetch: FAILED (error: 401 Unauthorized) - Orchestrator pauses Agent B (which depends on Agent A's output)
- Orchestrator decides: fix auth and retry, or escalate
The structured event broke the cascade at step 2 instead of letting bad data flow downstream.
Subagent writes its status as the last line of its output file:
## Output: research-api
[... actual analysis content ...]
---
STATUS: completed | output: artifacts/research-api.md | 2.3KBOr for failures:
## Output: research-api
STATUS: FAILED | error: API endpoint returned 401, token may be expired{
"task_id": "research-api",
"status": "FAILED",
"error": "API endpoint returned 401",
"error_type": "auth",
"attempts": 1,
"timestamp": "2026-03-15T10:30:00Z"
}When you receive an event:
| Event | Action |
|---|---|
completed |
Check acceptance criteria. If pass → accept. If fail → retry (up to 2x). |
FAILED |
Log error. Pause dependent tasks. Classify error type (see Error Handling). Decide: retry, restart, or escalate. |
running |
If within timeout → wait. If past timeout → nudge or kill. |
blocked |
Read the reason. If you can resolve it → resolve and resume. If not → escalate. |
| No event at all | This is the worst case. After timeout, assume failure. Kill the agent, write a failure event yourself, proceed with error handling. |
This is a mandatory practice, not a suggestion:
Any agent that performs an operation must send a clear completion notification immediately upon finishing. Not embedded in a long analysis. Not after a follow-up question. Immediately.
| Format | Example |
|---|---|
| Success | "Task research-api completed. Output: artifacts/research-api.md" |
| Failure | "Task research-api FAILED: API returned 401, credentials may be expired" |
In practice, agents tend to:
- Complete a task
- Start analyzing the result
- Write a long response about the analysis
- Mention completion somewhere in paragraph 4
The human (or orchestrator) then has to read the entire response to figure out if the task actually succeeded. A separate, immediate notification eliminates this.
See also: Error Event Template, Circuit Breaker for what happens after repeated failures.