A structured root cause analysis (RCA) framework combining the 5 Ws and 5 Whys.
Probabilistic, formally proven, and domain-agnostic — engineering incidents, medicine, security, business, legal.
SOMETHING HAPPENED → TO SOMEONE → SOMEWHERE → AT SOME POINT → FOR SOME REASON
Not a template. An algorithm.
Bayesian probability at every node. Causal inference at every edge. LLM-assisted or fully manual.
📋 Framework · 🧮 The Math · 🔬 Proofs · 📊 Diagrams · 📖 Docs · 📚 Research
Every incident, outage, failure, or decision has the same anatomy.
The vocabulary changes. The causal structure never does.
The 5 Ws tell you what happened.
The 5 Whys tell you why it happened.
Together they tell you what to fix — and in what order.
HTSA replaces the blank postmortem doc and the sticky-note 5 Whys session
with a formal algorithm: DAG traversal + Bayesian updating + causal counterfactual tests.
Establish the full picture before drilling into cause.
| Question | What It Captures |
|---|---|
| Who | The actor, subject, or stakeholder involved |
| What | The event, problem, or incident |
| When | The timeline — before, during, and after |
| Where | The location, system, environment, or context |
| Why | The surface-level, immediately apparent reason |
Start at the surface Why. Ask why again.
Keep going until you hit something you can actually change.
Why (surface)
└─► Why 1
└─► Why 2
└─► Why 3
└─► Why 4
└─► Why 5: ROOT CAUSE
Whys can and should branch. Real problems are rarely single-cause.
Map each root cause to a concrete change. Apply the counterfactual test:
"If this change had existed before the problem occurred,
would the problem still have happened?"
Each root cause is either fixed, mitigated, or accepted.
Confirm the fix worked. Update your priors.
The framework compounds over time — but only if learning is explicit.
| Domain | Who | What | When | Where | Why |
|---|---|---|---|---|---|
| ⚙️ SRE / Engineering | System / Team | Outage, incident | Incident timeline | Service / Component | Alert or error |
| 🏥 Medicine | Patient | Diagnosis | Onset | Body system | Presenting symptom |
| 🔒 Security | Threat actor | Breach | Attack window | Vulnerability | Attack vector |
| 📈 Business | Team / Process | Bottleneck | Quarter | Department | Stated reason |
| ⚖️ Legal | Defendant | Act | Date | Jurisdiction | Motive |
| 🧠 Personal | You | Decision | Moment | Context | Emotion |
Each domain uses the same algorithm. The math is domain-agnostic.
The framework is an applied graph traversal algorithm for causal inference —
with probability weighting, entropy reduction, and Bayesian evidence updating at every node.
| # | Concept | What It Answers |
|---|---|---|
| 01 | Graph Theory | What is the structure of an investigation? |
| 02 | Exponential Problem Space | Why do investigations feel overwhelming? |
| 03 | Causal Inference | How do you prove something caused something else? |
| 04 | Information Theory | How do you measure investigative progress? |
| 05 | Bayesian Reasoning | How do you weigh competing causes? |
| 06 | Search Algorithms | How do you move through the Why tree? |
| 07 | Cognitive Biases | What corrupts the investigation? |
| 08 | Evidence Evaluation | How do you know which evidence to trust? |
| 09 | Causation Theory | How do you classify and quantify actual causes? |
| 10 | Intervention Theory | How do you find the minimal set of fixes? |
Map before you drill. Complete the 5 Ws before starting the 5 Whys.
Evidence at every node. An assertion without evidence is a guess — tier your evidence.
Branch when reality branches. Real incidents have multiple root causes. Follow all of them.
5 is a heuristic, not a rule. Stop when you reach something you can actually change.
The counterfactual test closes the loop. If the fix had existed, would the incident still have happened? If yes, go deeper.
The framework is recursive. A root cause can become a new incident. Run it again.
The framework is also available as a Python library (v2.0.0) with built-in LLM integration.
Works with any provider — OpenAI, Anthropic, Groq, Mistral, Ollama, or any OpenAI-compatible endpoint.
cd engine && uv run python -c "from htsa_engine import Investigation; print('ready')"Auto-investigate with any LLM — one call, all 4 layers:
from htsa_engine.llm import LLMAdvisor
advisor = LLMAdvisor("https://api.openai.com/v1", api_key="sk-...", model="gpt-4o")
inv = advisor.run("API returning 500 errors since 2:47 AM, EU region only")
print(inv.root_causes)
inv.save("investigation.json")Or drive it manually — full control over every decision:
from htsa_engine import Investigation, Evidence, EvidenceTier, EvidenceDirection
inv = Investigation(title="API 500 errors", pruning_threshold=0.05)
inv.set_situation(who_affected="Users", what="500 errors", when_during="2:47 AM", where="EU-west", why_surface="Load spike")
inv.complete_situation()
origin = inv.start_causal_chain("Server errors under load")
branch = inv.add_hypothesis(origin, "Memory leak", probability=0.6)
# ... add evidence, mark root cause, resolve, verify
inv.save("investigation.json")v2 — Causation analysis — quantify and prioritize root causes:
# HP2015 + NESS three-stage counterfactual test
result = inv.run_hp2015_test(branch, origin)
print(result.is_root_cause, result.w_partition)
# Probability of Necessity and Sufficiency
pns = inv.compute_pns(branch, pn=0.8, ps=0.7)
print(pns.causation_type) # "single_root_cause" | "and_node" | "or_node"
# Find the smallest set of fixes that achieves 90% coverage
intervention = inv.compute_minimal_intervention_set(theta=0.90)
print(intervention.minimal_set, intervention.coverage)
# Evidence budget — how many Tier-1 evidence items are needed?
budget = inv.evidence_budget(branch, alternative_posteriors={"other_node": 0.3})
print(budget.n_required, budget.is_indistinguishable)| HTSA | PyRCA / BARO | DoWhy | Postmortem templates | |
|---|---|---|---|---|
| Approach | Structured algorithm | ML / metrics | Statistical | Blank form |
| Input | Any problem | Prometheus metrics | Data frames | Text |
| Causal proof | HP2015 + NESS + PNS | Correlation-based | do-calculus | None |
| Works without data | Yes | No | No | Yes |
| Cross-domain | Yes | AIOps only | Research only | Yes |
| LLM integration | Built-in | No | No | No |