[IMPROVEMENT] Improve opensre based on benchmark results

### Current state

## What this is

We ran opensre against the Cloud-OpsBench test suite (#2074) and read the agent code. This issue tracks the changes we want to make to opensre as a result.

The full analysis is here: [link to the #2074 summary comment](https://github.com/Tracer-Cloud/opensre/issues/2074)

## The rule

Every change is to **opensre itself** — the product. We are **not** changing the test to look better. If a change only helps the score and not a real user, it doesn't belong here.

## What the benchmark showed (short version)

- opensre often does extra work: repeats the same lookups, runs more laps than needed, and only stops when it gives up or hits the 20-lap limit.
- On big cases it throws away its best evidence first when space runs low.
- For the `service` category, more thinking doesn't help — it needs memory of past incidents.
- The right answer is often its second guess.

A few terms: **lap** = one round-trip to the AI model. **a1** = how often it gets the whole answer right. **Seed evidence** = the first batch of trustworthy data it pulls automatically.

## How we prove each change helped

Two gates, every time:

1. **On vs. off.** Run the benchmark with the change on and off, on the same cases. Keep it only if the target improved (fewer laps, lower cost, or higher accuracy) **and** nothing important got worse — especially accuracy.
2. **Held-out for anything that learns** (memory, reranking, wording). Test on cases it hasn't seen, and write down the expected result *before* running — so we measure real improvement, not memorization, and can't fool ourselves after.

## The work (in order)

**Phase 0 — prerequisite**
- [ ] 0 Record what the agent decided at each step

**Phase 1 — quick, safe, high value**
- [ ] 1 Stop fetching the same data twice
- [ ] 2 Reuse the unchanging part of the prompt (caching)
- [ ] 3 Stop deleting the best evidence

**Phase 2 — next**
- [ ] 4 Know when to stop
- [ ] 5 Reconsider before answering (use the second guess)
- [ ] 6 Use clear, consistent words for the cause

**Phase 3 — bigger, later**
- [ ] 7 Give opensre memory of past incidents
- [ ] 8 Decide the approach per incident (a small planner)

> Bottom line: most wasted time is doing the same work twice and not knowing when
> to stop. Most wrong answers are deleted evidence or a wording / second-guess
> problem. None of these need a smarter model — just a smarter loop.


### Desired state

Improved opensre

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] Improve opensre based on benchmark results #2903

Current state

What this is

The rule

What the benchmark showed (short version)

How we prove each change helped

The work (in order)

Desired state

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[IMPROVEMENT] Improve opensre based on benchmark results #2903

Description

Current state

What this is

The rule

What the benchmark showed (short version)

How we prove each change helped

The work (in order)

Desired state

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions