For non-agent technical work, the classic shape still applies. We don't reinvent it; we add the "AI-shaped" considerations that often go missing.
What we're solving and why now.
Both.
Diagram + prose.
At least two. Each with one-line "why not."
Surfaces — RPC, REST, library, CLI, SDK.
Schema, retention, isolation.
What happens when things go wrong.
Latency, throughput, cost.
Threat model, AuthN/AuthZ, secrets, PII.
Traces, metrics, logs, alerts.
Stages, kill switch, rollback.
Honest list.
When the system uses or is used by LLMs, also include:
- Provider strategy: which provider(s), which fallbacks, why
- Cost ceilings: per-request, per-user, daily total
- Prompt versioning: where prompts live, who can change them, eval gates
- Memory model: where it lives, how it's curated, isolation
- Eval plan: regression / safety / quality thresholds
- Injection surface: where untrusted content enters, how it's sanitized
If you skip any of these, expect to retrofit them after an incident.
Aim for 2–6 pages. Longer = the design probably needs to be smaller.
/templates/DESIGN_DOC.md.templateadr_guide.md— record the decisions the design produces