Skip to content

[IMPROVEMENT] Improve opensre based on benchmark results #2903

@YauhenBichel

Description

@YauhenBichel

Current state

What this is

We ran opensre against the Cloud-OpsBench test suite (#2074) and read the agent code. This issue tracks the changes we want to make to opensre as a result.

The full analysis is here: link to the #2074 summary comment

The rule

Every change is to opensre itself — the product. We are not changing the test to look better. If a change only helps the score and not a real user, it doesn't belong here.

What the benchmark showed (short version)

  • opensre often does extra work: repeats the same lookups, runs more laps than needed, and only stops when it gives up or hits the 20-lap limit.
  • On big cases it throws away its best evidence first when space runs low.
  • For the service category, more thinking doesn't help — it needs memory of past incidents.
  • The right answer is often its second guess.

A few terms: lap = one round-trip to the AI model. a1 = how often it gets the whole answer right. Seed evidence = the first batch of trustworthy data it pulls automatically.

How we prove each change helped

Two gates, every time:

  1. On vs. off. Run the benchmark with the change on and off, on the same cases. Keep it only if the target improved (fewer laps, lower cost, or higher accuracy) and nothing important got worse — especially accuracy.
  2. Held-out for anything that learns (memory, reranking, wording). Test on cases it hasn't seen, and write down the expected result before running — so we measure real improvement, not memorization, and can't fool ourselves after.

The work (in order)

Phase 0 — prerequisite

  • 0 Record what the agent decided at each step

Phase 1 — quick, safe, high value

  • 1 Stop fetching the same data twice
  • 2 Reuse the unchanging part of the prompt (caching)
  • 3 Stop deleting the best evidence

Phase 2 — next

  • 4 Know when to stop
  • 5 Reconsider before answering (use the second guess)
  • 6 Use clear, consistent words for the cause

Phase 3 — bigger, later

  • 7 Give opensre memory of past incidents
  • 8 Decide the approach per incident (a small planner)

Bottom line: most wasted time is doing the same work twice and not knowing when
to stop. Most wrong answers are deleted evidence or a wording / second-guess
problem. None of these need a smarter model — just a smarter loop.

Desired state

Improved opensre

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions