Top-level docs for the sre-on-call project.
- Deployment — build images → push to ECR → terraform apply → secret hydration.
- Slack app setup — create the Slack app (manifest or manual), scopes, events, triggers.
- Testing — synthetic webhook + real Slack alert procedures.
- Architecture diagram — generated from
architecture.d2.
| Agent | Purpose |
|---|---|
| Master | Orchestrates investigations across specialized agents, routes/synthesizes, enforces deadlines, posts the Incident Report. |
| Slack Scanner | Scans Slack channel history for correlated alerts within an investigation window. |
| CloudWatch Logs | Discovers real log groups, then queries AWS CloudWatch Logs Insights around the incident. |
| EKS | Gathers Kubernetes cluster state (pods, events, logs, node conditions). |
| Incident History | Finds similar past incidents via embedding similarity search. |
| Discord Scanner | Scans Discord channel history. Checked-in but not in config.yaml for this deployment. |
See CONTEXT.md at the repo root for the canonical term definitions (AlertContext, Finding, AgentResult, ToolResult, WebhookAdapter, ChatPoster, ReportRenderer, ChannelMessageSource).