Skip to content

damionrashford/htsa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to Solve Anything

A structured root cause analysis (RCA) framework combining the 5 Ws and 5 Whys.
Probabilistic, formally proven, and domain-agnostic — engineering incidents, medicine, security, business, legal.

SOMETHING HAPPENED → TO SOMEONE → SOMEWHERE → AT SOME POINT → FOR SOME REASON

Not a template. An algorithm.
Bayesian probability at every node. Causal inference at every edge. LLM-assisted or fully manual.

📋 Framework   ·   🧮 The Math   ·   🔬 Proofs   ·   📊 Diagrams   ·   📖 Docs   ·   📚 Research


💡 The Core Insight

Every incident, outage, failure, or decision has the same anatomy.
The vocabulary changes. The causal structure never does.

The 5 Ws tell you what happened.
The 5 Whys tell you why it happened.
Together they tell you what to fix — and in what order.

HTSA replaces the blank postmortem doc and the sticky-note 5 Whys session
with a formal algorithm: DAG traversal + Bayesian updating + causal counterfactual tests.


🔍 The Four Layers

Layer 1 — Situation Map (5 Ws)

Establish the full picture before drilling into cause.

Question What It Captures
Who The actor, subject, or stakeholder involved
What The event, problem, or incident
When The timeline — before, during, and after
Where The location, system, environment, or context
Why The surface-level, immediately apparent reason

Layer 2 — Causal Chain (5 Whys)

Start at the surface Why. Ask why again.
Keep going until you hit something you can actually change.

Why (surface)
  └─► Why 1
        └─► Why 2
              └─► Why 3
                    └─► Why 4
                          └─► Why 5: ROOT CAUSE

Whys can and should branch. Real problems are rarely single-cause.

Layer 3 — Resolution

Map each root cause to a concrete change. Apply the counterfactual test:

"If this change had existed before the problem occurred,
would the problem still have happened?"

Each root cause is either fixed, mitigated, or accepted.

Layer 4 — Verification and Learning

Confirm the fix worked. Update your priors.
The framework compounds over time — but only if learning is explicit.


🌐 Works Everywhere

Domain Who What When Where Why
⚙️ SRE / Engineering System / Team Outage, incident Incident timeline Service / Component Alert or error
🏥 Medicine Patient Diagnosis Onset Body system Presenting symptom
🔒 Security Threat actor Breach Attack window Vulnerability Attack vector
📈 Business Team / Process Bottleneck Quarter Department Stated reason
⚖️ Legal Defendant Act Date Jurisdiction Motive
🧠 Personal You Decision Moment Context Emotion

Each domain uses the same algorithm. The math is domain-agnostic.


🧮 The Math

The framework is an applied graph traversal algorithm for causal inference —
with probability weighting, entropy reduction, and Bayesian evidence updating at every node.

# Concept What It Answers
01 Graph Theory What is the structure of an investigation?
02 Exponential Problem Space Why do investigations feel overwhelming?
03 Causal Inference How do you prove something caused something else?
04 Information Theory How do you measure investigative progress?
05 Bayesian Reasoning How do you weigh competing causes?
06 Search Algorithms How do you move through the Why tree?
07 Cognitive Biases What corrupts the investigation?
08 Evidence Evaluation How do you know which evidence to trust?
09 Causation Theory How do you classify and quantify actual causes?
10 Intervention Theory How do you find the minimal set of fixes?

📏 Rules

Map before you drill. Complete the 5 Ws before starting the 5 Whys.

Evidence at every node. An assertion without evidence is a guess — tier your evidence.

Branch when reality branches. Real incidents have multiple root causes. Follow all of them.

5 is a heuristic, not a rule. Stop when you reach something you can actually change.

The counterfactual test closes the loop. If the fix had existed, would the incident still have happened? If yes, go deeper.

The framework is recursive. A root cause can become a new incident. Run it again.


⚙️ Engine

The framework is also available as a Python library (v2.0.0) with built-in LLM integration.
Works with any provider — OpenAI, Anthropic, Groq, Mistral, Ollama, or any OpenAI-compatible endpoint.

cd engine && uv run python -c "from htsa_engine import Investigation; print('ready')"

Auto-investigate with any LLM — one call, all 4 layers:

from htsa_engine.llm import LLMAdvisor

advisor = LLMAdvisor("https://api.openai.com/v1", api_key="sk-...", model="gpt-4o")
inv = advisor.run("API returning 500 errors since 2:47 AM, EU region only")

print(inv.root_causes)
inv.save("investigation.json")

Or drive it manually — full control over every decision:

from htsa_engine import Investigation, Evidence, EvidenceTier, EvidenceDirection

inv = Investigation(title="API 500 errors", pruning_threshold=0.05)
inv.set_situation(who_affected="Users", what="500 errors", when_during="2:47 AM", where="EU-west", why_surface="Load spike")
inv.complete_situation()
origin = inv.start_causal_chain("Server errors under load")
branch = inv.add_hypothesis(origin, "Memory leak", probability=0.6)
# ... add evidence, mark root cause, resolve, verify
inv.save("investigation.json")

v2 — Causation analysis — quantify and prioritize root causes:

# HP2015 + NESS three-stage counterfactual test
result = inv.run_hp2015_test(branch, origin)
print(result.is_root_cause, result.w_partition)

# Probability of Necessity and Sufficiency
pns = inv.compute_pns(branch, pn=0.8, ps=0.7)
print(pns.causation_type)  # "single_root_cause" | "and_node" | "or_node"

# Find the smallest set of fixes that achieves 90% coverage
intervention = inv.compute_minimal_intervention_set(theta=0.90)
print(intervention.minimal_set, intervention.coverage)

# Evidence budget — how many Tier-1 evidence items are needed?
budget = inv.evidence_budget(branch, alternative_posteriors={"other_node": 0.3})
print(budget.n_required, budget.is_indistinguishable)

Full engine documentation →



🔍 How HTSA Differs

HTSA PyRCA / BARO DoWhy Postmortem templates
Approach Structured algorithm ML / metrics Statistical Blank form
Input Any problem Prometheus metrics Data frames Text
Causal proof HP2015 + NESS + PNS Correlation-based do-calculus None
Works without data Yes No No Yes
Cross-domain Yes AIOps only Research only Yes
LLM integration Built-in No No No

MIT Licensed · Contributions welcome

About

Structured investigation method: 5 Ws + 5 Whys with Bayesian reasoning, causal DAG traversal, and formal proofs. For any domain — engineering incidents, business failures, medical diagnosis, personal decisions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors