[Feature] Span-level context confidence scores in multi-hop RAG retrieval — per-span quality signals for observability

## Background

Phoenix is excellent for LLM/RAG tracing and observability. I've been using it for my [`finrag-eval`](https://github.com/Ruthwik-Data/finrag-eval) project, which evaluates multi-hop RAG pipelines over financial documents.

Phoenix traces capture the full span tree for retrieval chains, but the **span-level data doesn't expose context quality signals** — only the final retrieved context and output.

## The Gap

For multi-hop RAG pipelines, the most important observability signal isn't the final output quality — it's **which retrieval span degraded the chain.** 

Currently, Phoenix shows:
```
Span 1: retrieval (0.8s) ✓
Span 2: reranking (0.3s) ✓
Span 3: synthesis (1.2s) ✓
Final answer: [text]
```

What's missing:
```
Span 1: retrieval — context_confidence: 0.92, retrieved_chunks: 5
Span 2: reranking — context_confidence: 0.65, chunks_retained: 2  ← DEGRADATION HERE
Span 3: synthesis — context_confidence: 0.65 (inherited from upstream)
Final answer quality: 0.61 (directly caused by Span 2 degradation)
```

## Feature Request

Add **span-level context confidence** as a first-class attribute in Phoenix's RAG span schema:

```python
# Proposed: Phoenix span attributes for RAG spans
span.set_attribute("retrieval.context_confidence", 0.92)  # Per-span quality score
span.set_attribute("retrieval.chunks_retrieved", 5)
span.set_attribute("retrieval.chunks_above_threshold", 5)
span.set_attribute("reranking.context_confidence", 0.65)  # Post-rerank quality
span.set_attribute("reranking.confidence_delta", -0.27)   # Signal: quality dropped here
```

This would enable Phoenix's existing evaluation framework to **attribute final answer quality degradation back to specific retrieval spans** — closing the loop between traces and evals.

## Why This Matters

For production RAG in financial/regulated domains, debugging "why did this answer fail?" requires span-level attribution, not just final output quality. Phoenix already has the trace infrastructure — adding `context_confidence` as a standard span attribute would make it the best tool for retrieval chain debugging.

## Minimal Implementation Path

1. Add `context_confidence` to the OpenInference RAG span spec
2. Expose it in Phoenix's trace detail view as a per-span quality indicator
3. Allow filtering/sorting spans by `context_confidence` in the UI
4. Add `confidence_delta` between parent/child retrieval spans — surfaces degradation points automatically

## Related Work

I explored this problem in my [`finrag-eval`](https://github.com/Ruthwik-Data/finrag-eval) project and had to instrument span confidence manually using custom attributes. Standardizing this in Phoenix/OpenInference would make it available to the whole ecosystem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Span-level context confidence scores in multi-hop RAG retrieval — per-span quality signals for observability #3256

Background

The Gap

Feature Request

Why This Matters

Minimal Implementation Path

Related Work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Span-level context confidence scores in multi-hop RAG retrieval — per-span quality signals for observability #3256

Description

Background

The Gap

Feature Request

Why This Matters

Minimal Implementation Path

Related Work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions