|
| 1 | +# Closed-Loop Resolution (CLR) for Test Automation |
| 2 | + |
| 3 | +Automated bug detection, triage, RCA, and fix verification for CI and hardware validation infrastructure. |
| 4 | + |
| 5 | +## Scope |
| 6 | + |
| 7 | +**In scope (MVP):** Automated test failures from CI / hardware validation |
| 8 | +**Deferred:** Manual debugging with Claude Code / other agents |
| 9 | + |
| 10 | +### Why Automated Tests First |
| 11 | + |
| 12 | +| Automated | Manual | |
| 13 | +|-----------|--------| |
| 14 | +| Structured output (test name, assertion, stack trace, exit code) | Vague symptoms ("feels slow") | |
| 15 | +| High signal — failure = real bug | Requires triage | |
| 16 | +| Repeatable — re-run verifies fix | Human verifies | |
| 17 | +| Already captured in CI logs | New capture mechanism needed | |
| 18 | + |
| 19 | +## MVP Requirements |
| 20 | + |
| 21 | +### 1. Triage |
| 22 | + |
| 23 | +- **Trigger on every failure** from scaled test automation platform |
| 24 | +- **Search Jira for existing issues** — match new failure against open bugs |
| 25 | +- **Cluster & deduplicate** related failures (same root cause, different symptoms) |
| 26 | +- **Identify failure level:** |
| 27 | + - Test code itself |
| 28 | + - Internal test support Python wheel |
| 29 | + - Platform/OS |
| 30 | + - DUT firmware |
| 31 | +- **All unique bugs need a Jira issue** — create if no match found |
| 32 | + |
| 33 | +### 2. RCA |
| 34 | + |
| 35 | +- Generate ranked hypotheses for each failure |
| 36 | +- Query graph for matching symptoms / code areas / known root causes |
| 37 | +- Skip known-bad approaches from prior `FailedFix` nodes |
| 38 | + |
| 39 | +### 3. Test Fix Hypotheses |
| 40 | + |
| 41 | +- Execute fixes via **new MCP test harness** (to be built) |
| 42 | +- Re-run tests to verify |
| 43 | +- Record hypothesis status (confirmed/disproved) with evidence |
| 44 | + |
| 45 | +### 4. Integration |
| 46 | + |
| 47 | +- Update Jira issue with findings (link to graph, hypotheses tried, evidence) |
| 48 | +- Raise PRs for passing fixes — **always human review** (no auto-merge) |
| 49 | + |
| 50 | +## CLR Flow |
| 51 | + |
| 52 | +``` |
| 53 | +Test Fails (scaled test automation platform) |
| 54 | + ↓ |
| 55 | +Extract symptom (test name, assertion, stack trace, exit code, env) |
| 56 | + ↓ |
| 57 | +Search Jira for existing issues (JQL + LLM rerank) |
| 58 | + ↓ |
| 59 | +If match → link failure to existing issue |
| 60 | +If no match → create new Jira issue |
| 61 | + ↓ |
| 62 | +Triage: cluster with related failures, identify failure level |
| 63 | + ↓ |
| 64 | +Query graph for matches (symptom, code area, known root causes) |
| 65 | + ↓ |
| 66 | +Generate ranked hypotheses (not fixes yet) |
| 67 | + ↓ |
| 68 | +For each hypothesis (priority order): |
| 69 | + ├─ Record Hypothesis node with reasoning |
| 70 | + ├─ Generate Fix based on hypothesis |
| 71 | + ├─ Apply fix, re-run test via MCP harness |
| 72 | + ├─ If pass → mark Hypothesis CONFIRMED, record Fix |
| 73 | + └─ If fail → mark Hypothesis DISPROVED with evidence, continue |
| 74 | + ↓ |
| 75 | +All hypotheses exhausted? → escalate, record as Unresolved |
| 76 | + ↓ |
| 77 | +Update Jira with findings, raise PR for confirmed fix (human review) |
| 78 | +``` |
| 79 | + |
| 80 | +## Why Hypotheses Matter |
| 81 | + |
| 82 | +When a test fails, the agent reasons through possibilities: |
| 83 | + |
| 84 | +1. "Could be X because of evidence A" |
| 85 | +2. "Or Y because of evidence B" |
| 86 | +3. Tries X → fails → now knows X was wrong *for this symptom pattern* |
| 87 | +4. Tries Y → works → now knows Y was right |
| 88 | + |
| 89 | +Recording only the final fix loses: |
| 90 | +- Why X seemed plausible (might help for different symptom) |
| 91 | +- Why X failed (avoid repeating for similar symptoms) |
| 92 | +- The reasoning that led to Y (replicate for similar bugs) |
| 93 | + |
| 94 | +## Graph Structure |
| 95 | + |
| 96 | +### Node Types |
| 97 | + |
| 98 | +``` |
| 99 | +TestFailure — Captured failure from CI/hardware validation |
| 100 | +FailureCluster — Group of related TestFailures (same root cause) |
| 101 | +Hypothesis — "I think X causes this because Y" |
| 102 | +Fix — Attempted solution (passed or failed) |
| 103 | +RootCause — Confirmed underlying cause |
| 104 | +FailureLevel — Where the bug lives (test/wheel/platform/firmware) |
| 105 | +CodeArea — File/module/function where bugs cluster |
| 106 | +``` |
| 107 | + |
| 108 | +### Relationships |
| 109 | + |
| 110 | +``` |
| 111 | +TestFailure -[:CLUSTERED_IN]-> FailureCluster |
| 112 | +TestFailure -[:AT_LEVEL]-> FailureLevel |
| 113 | +TestFailure -[:HAS_HYPOTHESIS]-> Hypothesis |
| 114 | +Hypothesis -[:PRODUCED]-> Fix |
| 115 | +Hypothesis -[:INFORMED_BY]-> Hypothesis (prior disproved → led to this) |
| 116 | +Fix -[:RESOLVES]-> TestFailure |
| 117 | +Fix -[:VALIDATES]-> Hypothesis |
| 118 | +Fix -[:CONFIRMS]-> RootCause |
| 119 | +RootCause -[:AFFECTS]-> CodeArea |
| 120 | +FailureCluster -[:SHARES_ROOT_CAUSE]-> RootCause |
| 121 | +``` |
| 122 | + |
| 123 | +### Example |
| 124 | + |
| 125 | +``` |
| 126 | +TestFailure (test_auth_login_empty_email) |
| 127 | + ├─ AT_LEVEL → FailureLevel (test-support-wheel) |
| 128 | + ├─ CLUSTERED_IN → FailureCluster (auth-validation-nulls) |
| 129 | + │ |
| 130 | + ├─ HAS_HYPOTHESIS → Hypothesis (priority: 1) |
| 131 | + │ ├─ reasoning: "Stack shows null in validate_email, likely missing null check" |
| 132 | + │ ├─ confidence: 0.8 |
| 133 | + │ ├─ PRODUCED → Fix (attempt 1) |
| 134 | + │ │ └─ result: failed |
| 135 | + │ │ └─ evidence: "Same assertion, different line" |
| 136 | + │ └─ status: disproved |
| 137 | + │ |
| 138 | + ├─ HAS_HYPOTHESIS → Hypothesis (priority: 2) |
| 139 | + │ ├─ reasoning: "Empty string passes null check but fails downstream regex" |
| 140 | + │ ├─ confidence: 0.6 |
| 141 | + │ ├─ INFORMED_BY → Hypothesis (priority: 1) |
| 142 | + │ ├─ PRODUCED → Fix (attempt 2) |
| 143 | + │ │ └─ result: passed |
| 144 | + │ │ └─ commit: <sha> |
| 145 | + │ └─ status: confirmed |
| 146 | + │ |
| 147 | + └─ RESOLVED_BY → Fix (attempt 2) |
| 148 | + └─ VALIDATES → Hypothesis (priority: 2) |
| 149 | + └─ CONFIRMS → RootCause (empty-string-vs-null) |
| 150 | +``` |
| 151 | + |
| 152 | +### Node Properties |
| 153 | + |
| 154 | +``` |
| 155 | +TestFailure |
| 156 | + testName: string |
| 157 | + assertion: string |
| 158 | + stackFingerprint: hash (normalized, line-numbers stripped) |
| 159 | + commit: sha |
| 160 | + branch: ref |
| 161 | + env: ci-runner-id | hardware-rig-id |
| 162 | + timestamp: datetime |
| 163 | +
|
| 164 | +Hypothesis |
| 165 | + reasoning: text |
| 166 | + confidence: float (0-1) |
| 167 | + priority: int (order tried) |
| 168 | + status: pending | confirmed | disproved |
| 169 | + evidence: text (why confirmed/disproved) |
| 170 | +
|
| 171 | +Fix |
| 172 | + diff: patch |
| 173 | + result: passed | failed |
| 174 | + commit: sha (if merged) |
| 175 | + prUrl: url (if raised) |
| 176 | +
|
| 177 | +RootCause |
| 178 | + description: text |
| 179 | + codeArea: path |
| 180 | +
|
| 181 | +FailureLevel |
| 182 | + level: enum (test-code | test-support-wheel | platform-os | dut-firmware) |
| 183 | +``` |
| 184 | + |
| 185 | +## What This Enables |
| 186 | + |
| 187 | +**Learning from failures:** |
| 188 | +- "What hypotheses disproved for null-related errors in auth/?" → "Empty string vs null is common miss" |
| 189 | + |
| 190 | +**Improving reasoning:** |
| 191 | +- "Confirmation rate for high-confidence hypotheses?" → calibration feedback |
| 192 | + |
| 193 | +**Pattern recognition:** |
| 194 | +- "When hypothesis A fails, what usually works?" → "Regex issues follow null checks 70% of time in this area" |
| 195 | + |
| 196 | +**Triage acceleration:** |
| 197 | +- "New failure in auth/ at test-support-wheel level" → auto-cluster with similar failures |
| 198 | + |
| 199 | +## Symptom Matching |
| 200 | + |
| 201 | +### Approaches (Hybrid) |
| 202 | + |
| 203 | +1. **Structured extraction at capture** — `errorType`, `codeArea`, `failureLevel` as indexed properties |
| 204 | +2. **Stack fingerprinting** — Normalized hash (strip line numbers, paths) for exact-match fast path |
| 205 | +3. **Embedding on description** — Semantic similarity for behavioral symptoms |
| 206 | +4. **Reranker** — LLM filters false positives from top-N candidates |
| 207 | + |
| 208 | +### Design Decision |
| 209 | + |
| 210 | +**Capture raw, extract automatically, allow correction.** |
| 211 | + |
| 212 | +- Raw symptom stored immediately (zero friction) |
| 213 | +- Background job extracts structured facets (best effort) |
| 214 | +- Dashboard shows extractions with edit option (human refinement when needed) |
| 215 | + |
| 216 | +Rationale: |
| 217 | +- Capture friction kills adoption |
| 218 | +- Extraction improves over time; raw data doesn't degrade |
| 219 | +- 80% case (stack traces) extracts well; 20% (vague descriptions) may need human help |
| 220 | + |
| 221 | +## Jira Integration |
| 222 | + |
| 223 | +### The Problem |
| 224 | + |
| 225 | +Jira's native API uses JQL (keyword-based), not semantic search. Finding "similar" issues requires custom work. |
| 226 | + |
| 227 | +### Approaches in the Wild |
| 228 | + |
| 229 | +1. **Marketplace apps** — [Find Duplicates](https://marketplace.atlassian.com/apps/1212706/find-duplicates-detect-similar-issues-find-related-issues), [Duplicate AI](https://marketplace.atlassian.com/apps/1224971/duplicate-ai-find-merge-duplicate-issues) use ML for similarity scoring |
| 230 | +2. **Custom NLP** — [JIRA-Similar-Issue-Finder-App](https://github.com/bhavul/JIRA-Similar-Issue-Finder-App) trains ML model, comments related IDs |
| 231 | +3. **Vector databases** — Milvus, Pinecone for semantic search + duplicate detection |
| 232 | +4. **Research** — [GitBugs dataset](https://arxiv.org/html/2504.09651) (150k+ bug reports), [recent work](https://arxiv.org/abs/2504.14797) on automated duplicate detection |
| 233 | + |
| 234 | +### CLR Approach |
| 235 | + |
| 236 | +**JQL broad search + LLM rerank:** |
| 237 | + |
| 238 | +1. Query Jira via `/rest/api/3/search/jql` with loose JQL (project, component, date range, status) |
| 239 | +2. Fetch bulk JSON results (summary, description, labels, components) |
| 240 | +3. Pass candidates + new failure symptom to LLM for similarity scoring |
| 241 | +4. Threshold determines match vs new issue |
| 242 | + |
| 243 | +This avoids Jira plugin dependencies and uses our own LLM for consistency with rest of CLR. |
| 244 | + |
| 245 | +### Jira API Notes |
| 246 | + |
| 247 | +- Legacy `/rest/api/3/search` deprecated → use `/rest/api/3/search/jql` |
| 248 | +- Pagination via `nextPageToken` (not `startAt`) |
| 249 | +- POST for large JQL queries (JSON body) |
| 250 | +- Requires API token auth |
| 251 | + |
| 252 | +## Integration Points |
| 253 | + |
| 254 | +- **Scaled test automation platform** → receives failure, triggers CLR |
| 255 | +- **MCP test harness** → execute fixes, re-run tests (new, to be built) |
| 256 | +- **Git** → apply fix, push branch |
| 257 | +- **Jira API** → search existing issues, create/update issues |
| 258 | +- **PR API** → raise PRs for confirmed fixes (human review required) |
| 259 | +- **Graph (Neo4j)** → store all nodes/relationships |
| 260 | + |
| 261 | +## Open Questions |
| 262 | + |
| 263 | +1. **Hardware validation specifics** — What does "test failure" look like? Serial logs? Exit codes? |
| 264 | +2. **Flaky tests** — Require N consecutive failures before triggering CLR? |
| 265 | +3. **Jira field mapping** — Which fields to search/populate? Labels, components, custom fields? |
| 266 | +4. **MCP harness scope** — What test frameworks/runners to support initially? |
0 commit comments