Current state
What this is
We ran opensre against the Cloud-OpsBench test suite (#2074) and read the agent code. This issue tracks the changes we want to make to opensre as a result.
The full analysis is here: link to the #2074 summary comment
The rule
Every change is to opensre itself — the product. We are not changing the test to look better. If a change only helps the score and not a real user, it doesn't belong here.
What the benchmark showed (short version)
- opensre often does extra work: repeats the same lookups, runs more laps than needed, and only stops when it gives up or hits the 20-lap limit.
- On big cases it throws away its best evidence first when space runs low.
- For the
service category, more thinking doesn't help — it needs memory of past incidents.
- The right answer is often its second guess.
A few terms: lap = one round-trip to the AI model. a1 = how often it gets the whole answer right. Seed evidence = the first batch of trustworthy data it pulls automatically.
How we prove each change helped
Two gates, every time:
- On vs. off. Run the benchmark with the change on and off, on the same cases. Keep it only if the target improved (fewer laps, lower cost, or higher accuracy) and nothing important got worse — especially accuracy.
- Held-out for anything that learns (memory, reranking, wording). Test on cases it hasn't seen, and write down the expected result before running — so we measure real improvement, not memorization, and can't fool ourselves after.
The work (in order)
Phase 0 — prerequisite
Phase 1 — quick, safe, high value
Phase 2 — next
Phase 3 — bigger, later
Bottom line: most wasted time is doing the same work twice and not knowing when
to stop. Most wrong answers are deleted evidence or a wording / second-guess
problem. None of these need a smarter model — just a smarter loop.
Desired state
Improved opensre
Current state
What this is
We ran opensre against the Cloud-OpsBench test suite (#2074) and read the agent code. This issue tracks the changes we want to make to opensre as a result.
The full analysis is here: link to the #2074 summary comment
The rule
Every change is to opensre itself — the product. We are not changing the test to look better. If a change only helps the score and not a real user, it doesn't belong here.
What the benchmark showed (short version)
servicecategory, more thinking doesn't help — it needs memory of past incidents.A few terms: lap = one round-trip to the AI model. a1 = how often it gets the whole answer right. Seed evidence = the first batch of trustworthy data it pulls automatically.
How we prove each change helped
Two gates, every time:
The work (in order)
Phase 0 — prerequisite
Phase 1 — quick, safe, high value
Phase 2 — next
Phase 3 — bigger, later
Desired state
Improved opensre