Skip to content

Commit 1eb1cd9

Browse files
committed
final CLR discovery docs
1 parent 1582950 commit 1eb1cd9

2 files changed

Lines changed: 352 additions & 0 deletions

File tree

Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
# Closed-Loop Resolution (CLR) for Test Automation
2+
3+
Automated bug detection, triage, RCA, and fix verification for CI and hardware validation infrastructure.
4+
5+
## Scope
6+
7+
**In scope (MVP):** Automated test failures from CI / hardware validation
8+
**Deferred:** Manual debugging with Claude Code / other agents
9+
10+
### Why Automated Tests First
11+
12+
| Automated | Manual |
13+
|-----------|--------|
14+
| Structured output (test name, assertion, stack trace, exit code) | Vague symptoms ("feels slow") |
15+
| High signal — failure = real bug | Requires triage |
16+
| Repeatable — re-run verifies fix | Human verifies |
17+
| Already captured in CI logs | New capture mechanism needed |
18+
19+
## MVP Requirements
20+
21+
### 1. Triage
22+
23+
- **Trigger on every failure** from scaled test automation platform
24+
- **Search Jira for existing issues** — match new failure against open bugs
25+
- **Cluster & deduplicate** related failures (same root cause, different symptoms)
26+
- **Identify failure level:**
27+
- Test code itself
28+
- Internal test support Python wheel
29+
- Platform/OS
30+
- DUT firmware
31+
- **All unique bugs need a Jira issue** — create if no match found
32+
33+
### 2. RCA
34+
35+
- Generate ranked hypotheses for each failure
36+
- Query graph for matching symptoms / code areas / known root causes
37+
- Skip known-bad approaches from prior `FailedFix` nodes
38+
39+
### 3. Test Fix Hypotheses
40+
41+
- Execute fixes via **new MCP test harness** (to be built)
42+
- Re-run tests to verify
43+
- Record hypothesis status (confirmed/disproved) with evidence
44+
45+
### 4. Integration
46+
47+
- Update Jira issue with findings (link to graph, hypotheses tried, evidence)
48+
- Raise PRs for passing fixes — **always human review** (no auto-merge)
49+
50+
## CLR Flow
51+
52+
```
53+
Test Fails (scaled test automation platform)
54+
55+
Extract symptom (test name, assertion, stack trace, exit code, env)
56+
57+
Search Jira for existing issues (JQL + LLM rerank)
58+
59+
If match → link failure to existing issue
60+
If no match → create new Jira issue
61+
62+
Triage: cluster with related failures, identify failure level
63+
64+
Query graph for matches (symptom, code area, known root causes)
65+
66+
Generate ranked hypotheses (not fixes yet)
67+
68+
For each hypothesis (priority order):
69+
├─ Record Hypothesis node with reasoning
70+
├─ Generate Fix based on hypothesis
71+
├─ Apply fix, re-run test via MCP harness
72+
├─ If pass → mark Hypothesis CONFIRMED, record Fix
73+
└─ If fail → mark Hypothesis DISPROVED with evidence, continue
74+
75+
All hypotheses exhausted? → escalate, record as Unresolved
76+
77+
Update Jira with findings, raise PR for confirmed fix (human review)
78+
```
79+
80+
## Why Hypotheses Matter
81+
82+
When a test fails, the agent reasons through possibilities:
83+
84+
1. "Could be X because of evidence A"
85+
2. "Or Y because of evidence B"
86+
3. Tries X → fails → now knows X was wrong *for this symptom pattern*
87+
4. Tries Y → works → now knows Y was right
88+
89+
Recording only the final fix loses:
90+
- Why X seemed plausible (might help for different symptom)
91+
- Why X failed (avoid repeating for similar symptoms)
92+
- The reasoning that led to Y (replicate for similar bugs)
93+
94+
## Graph Structure
95+
96+
### Node Types
97+
98+
```
99+
TestFailure — Captured failure from CI/hardware validation
100+
FailureCluster — Group of related TestFailures (same root cause)
101+
Hypothesis — "I think X causes this because Y"
102+
Fix — Attempted solution (passed or failed)
103+
RootCause — Confirmed underlying cause
104+
FailureLevel — Where the bug lives (test/wheel/platform/firmware)
105+
CodeArea — File/module/function where bugs cluster
106+
```
107+
108+
### Relationships
109+
110+
```
111+
TestFailure -[:CLUSTERED_IN]-> FailureCluster
112+
TestFailure -[:AT_LEVEL]-> FailureLevel
113+
TestFailure -[:HAS_HYPOTHESIS]-> Hypothesis
114+
Hypothesis -[:PRODUCED]-> Fix
115+
Hypothesis -[:INFORMED_BY]-> Hypothesis (prior disproved → led to this)
116+
Fix -[:RESOLVES]-> TestFailure
117+
Fix -[:VALIDATES]-> Hypothesis
118+
Fix -[:CONFIRMS]-> RootCause
119+
RootCause -[:AFFECTS]-> CodeArea
120+
FailureCluster -[:SHARES_ROOT_CAUSE]-> RootCause
121+
```
122+
123+
### Example
124+
125+
```
126+
TestFailure (test_auth_login_empty_email)
127+
├─ AT_LEVEL → FailureLevel (test-support-wheel)
128+
├─ CLUSTERED_IN → FailureCluster (auth-validation-nulls)
129+
130+
├─ HAS_HYPOTHESIS → Hypothesis (priority: 1)
131+
│ ├─ reasoning: "Stack shows null in validate_email, likely missing null check"
132+
│ ├─ confidence: 0.8
133+
│ ├─ PRODUCED → Fix (attempt 1)
134+
│ │ └─ result: failed
135+
│ │ └─ evidence: "Same assertion, different line"
136+
│ └─ status: disproved
137+
138+
├─ HAS_HYPOTHESIS → Hypothesis (priority: 2)
139+
│ ├─ reasoning: "Empty string passes null check but fails downstream regex"
140+
│ ├─ confidence: 0.6
141+
│ ├─ INFORMED_BY → Hypothesis (priority: 1)
142+
│ ├─ PRODUCED → Fix (attempt 2)
143+
│ │ └─ result: passed
144+
│ │ └─ commit: <sha>
145+
│ └─ status: confirmed
146+
147+
└─ RESOLVED_BY → Fix (attempt 2)
148+
└─ VALIDATES → Hypothesis (priority: 2)
149+
└─ CONFIRMS → RootCause (empty-string-vs-null)
150+
```
151+
152+
### Node Properties
153+
154+
```
155+
TestFailure
156+
testName: string
157+
assertion: string
158+
stackFingerprint: hash (normalized, line-numbers stripped)
159+
commit: sha
160+
branch: ref
161+
env: ci-runner-id | hardware-rig-id
162+
timestamp: datetime
163+
164+
Hypothesis
165+
reasoning: text
166+
confidence: float (0-1)
167+
priority: int (order tried)
168+
status: pending | confirmed | disproved
169+
evidence: text (why confirmed/disproved)
170+
171+
Fix
172+
diff: patch
173+
result: passed | failed
174+
commit: sha (if merged)
175+
prUrl: url (if raised)
176+
177+
RootCause
178+
description: text
179+
codeArea: path
180+
181+
FailureLevel
182+
level: enum (test-code | test-support-wheel | platform-os | dut-firmware)
183+
```
184+
185+
## What This Enables
186+
187+
**Learning from failures:**
188+
- "What hypotheses disproved for null-related errors in auth/?" → "Empty string vs null is common miss"
189+
190+
**Improving reasoning:**
191+
- "Confirmation rate for high-confidence hypotheses?" → calibration feedback
192+
193+
**Pattern recognition:**
194+
- "When hypothesis A fails, what usually works?" → "Regex issues follow null checks 70% of time in this area"
195+
196+
**Triage acceleration:**
197+
- "New failure in auth/ at test-support-wheel level" → auto-cluster with similar failures
198+
199+
## Symptom Matching
200+
201+
### Approaches (Hybrid)
202+
203+
1. **Structured extraction at capture**`errorType`, `codeArea`, `failureLevel` as indexed properties
204+
2. **Stack fingerprinting** — Normalized hash (strip line numbers, paths) for exact-match fast path
205+
3. **Embedding on description** — Semantic similarity for behavioral symptoms
206+
4. **Reranker** — LLM filters false positives from top-N candidates
207+
208+
### Design Decision
209+
210+
**Capture raw, extract automatically, allow correction.**
211+
212+
- Raw symptom stored immediately (zero friction)
213+
- Background job extracts structured facets (best effort)
214+
- Dashboard shows extractions with edit option (human refinement when needed)
215+
216+
Rationale:
217+
- Capture friction kills adoption
218+
- Extraction improves over time; raw data doesn't degrade
219+
- 80% case (stack traces) extracts well; 20% (vague descriptions) may need human help
220+
221+
## Jira Integration
222+
223+
### The Problem
224+
225+
Jira's native API uses JQL (keyword-based), not semantic search. Finding "similar" issues requires custom work.
226+
227+
### Approaches in the Wild
228+
229+
1. **Marketplace apps**[Find Duplicates](https://marketplace.atlassian.com/apps/1212706/find-duplicates-detect-similar-issues-find-related-issues), [Duplicate AI](https://marketplace.atlassian.com/apps/1224971/duplicate-ai-find-merge-duplicate-issues) use ML for similarity scoring
230+
2. **Custom NLP**[JIRA-Similar-Issue-Finder-App](https://github.com/bhavul/JIRA-Similar-Issue-Finder-App) trains ML model, comments related IDs
231+
3. **Vector databases** — Milvus, Pinecone for semantic search + duplicate detection
232+
4. **Research**[GitBugs dataset](https://arxiv.org/html/2504.09651) (150k+ bug reports), [recent work](https://arxiv.org/abs/2504.14797) on automated duplicate detection
233+
234+
### CLR Approach
235+
236+
**JQL broad search + LLM rerank:**
237+
238+
1. Query Jira via `/rest/api/3/search/jql` with loose JQL (project, component, date range, status)
239+
2. Fetch bulk JSON results (summary, description, labels, components)
240+
3. Pass candidates + new failure symptom to LLM for similarity scoring
241+
4. Threshold determines match vs new issue
242+
243+
This avoids Jira plugin dependencies and uses our own LLM for consistency with rest of CLR.
244+
245+
### Jira API Notes
246+
247+
- Legacy `/rest/api/3/search` deprecated → use `/rest/api/3/search/jql`
248+
- Pagination via `nextPageToken` (not `startAt`)
249+
- POST for large JQL queries (JSON body)
250+
- Requires API token auth
251+
252+
## Integration Points
253+
254+
- **Scaled test automation platform** → receives failure, triggers CLR
255+
- **MCP test harness** → execute fixes, re-run tests (new, to be built)
256+
- **Git** → apply fix, push branch
257+
- **Jira API** → search existing issues, create/update issues
258+
- **PR API** → raise PRs for confirmed fixes (human review required)
259+
- **Graph (Neo4j)** → store all nodes/relationships
260+
261+
## Open Questions
262+
263+
1. **Hardware validation specifics** — What does "test failure" look like? Serial logs? Exit codes?
264+
2. **Flaky tests** — Require N consecutive failures before triggering CLR?
265+
3. **Jira field mapping** — Which fields to search/populate? Labels, components, custom fields?
266+
4. **MCP harness scope** — What test frameworks/runners to support initially?
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# When a Graph Doesn't Help
2+
3+
A graph adds overhead. When does it not pay off?
4+
5+
## Anti-Patterns
6+
7+
### 1. Low-Volume, High-Turnover Bugs
8+
9+
If bugs are:
10+
- Fixed quickly (hours, not days)
11+
- Rarely recur
12+
- Don't share root causes
13+
14+
No relationship structure to exploit. A flat Jira list works fine. Graph value comes from traversing connections — no connections, no value.
15+
16+
**Example:** Typos, obvious null checks, config mistakes. Fix once, never see again.
17+
18+
### 2. Isolated, Unrelated Failures
19+
20+
If each bug is truly independent:
21+
- Different code areas
22+
- Different root causes
23+
- No clustering pattern
24+
25+
Then `SHARES_ROOT_CAUSE`, `CLUSTERS_IN`, `INFORMED_BY` edges never form. Just storing nodes with no edges — a worse database.
26+
27+
**Example:** Random integration failures across unrelated services with no common dependency.
28+
29+
### 3. Deterministic, Single-Cause Failures
30+
31+
If the failure → cause → fix chain is 1:1:1 with no ambiguity:
32+
- One symptom maps to one cause
33+
- One cause has one fix
34+
- No hypothesis exploration needed
35+
36+
Hypothesis tracking overhead is wasted. Just log "test X failed, applied fix Y, done."
37+
38+
**Example:** Version mismatch errors. Fix is always "update version." No reasoning chain to capture.
39+
40+
### 4. High Noise-to-Signal Ratio
41+
42+
If most failures are:
43+
- Flaky tests (environment, timing)
44+
- Infrastructure issues (network, disk)
45+
- Not real bugs
46+
47+
Graph fills with noise. Every flaky test creates nodes that pollute similarity searches. More time filtering garbage than finding patterns.
48+
49+
**Example:** Test suite with 30% flake rate. Graph becomes a flaky test cemetery.
50+
51+
### 5. Short-Lived Codebases
52+
53+
If the code:
54+
- Changes radically every few months
55+
- Historical patterns don't predict future bugs
56+
- "What worked before" is irrelevant
57+
58+
Graph's historical knowledge decays faster than it accumulates. By the time you have useful patterns, the code has moved on.
59+
60+
**Example:** Rapid prototyping, throwaway projects, major rewrites.
61+
62+
## When the Graph Adds Value
63+
64+
The inverse:
65+
66+
| Graph Helps | Graph Doesn't Help |
67+
|-------------|-------------------|
68+
| Recurring bugs | Fix-once bugs |
69+
| Shared root causes | Isolated failures |
70+
| Multi-hypothesis RCA | Obvious single cause |
71+
| Stable codebase | Rapid churn |
72+
| Low flake rate | High noise |
73+
| Long-lived projects | Throwaway code |
74+
75+
## Implication for MVP
76+
77+
**Filter what enters the graph.** Not every `TestFailure` deserves a node.
78+
79+
Criteria for graph-worthy failures:
80+
- Passed flake detection (N consecutive failures, or deterministic repro)
81+
- Not infrastructure/environment (or tagged as such, kept separate)
82+
- In stable code areas (not actively being rewritten)
83+
84+
Options:
85+
1. Add `graphWorthy: boolean` property during triage
86+
2. Defer graph insertion until failure proves interesting (recurs, shares cause, etc.)

0 commit comments

Comments
 (0)