Skip to content

[Feature] Span-level context confidence scores in multi-hop RAG retrieval — per-span quality signals for observability #3256

Description

@Ruthwik-Data

Background

Phoenix is excellent for LLM/RAG tracing and observability. I've been using it for my finrag-eval project, which evaluates multi-hop RAG pipelines over financial documents.

Phoenix traces capture the full span tree for retrieval chains, but the span-level data doesn't expose context quality signals — only the final retrieved context and output.

The Gap

For multi-hop RAG pipelines, the most important observability signal isn't the final output quality — it's which retrieval span degraded the chain.

Currently, Phoenix shows:

Span 1: retrieval (0.8s) ✓
Span 2: reranking (0.3s) ✓
Span 3: synthesis (1.2s) ✓
Final answer: [text]

What's missing:

Span 1: retrieval — context_confidence: 0.92, retrieved_chunks: 5
Span 2: reranking — context_confidence: 0.65, chunks_retained: 2  ← DEGRADATION HERE
Span 3: synthesis — context_confidence: 0.65 (inherited from upstream)
Final answer quality: 0.61 (directly caused by Span 2 degradation)

Feature Request

Add span-level context confidence as a first-class attribute in Phoenix's RAG span schema:

# Proposed: Phoenix span attributes for RAG spans
span.set_attribute("retrieval.context_confidence", 0.92)  # Per-span quality score
span.set_attribute("retrieval.chunks_retrieved", 5)
span.set_attribute("retrieval.chunks_above_threshold", 5)
span.set_attribute("reranking.context_confidence", 0.65)  # Post-rerank quality
span.set_attribute("reranking.confidence_delta", -0.27)   # Signal: quality dropped here

This would enable Phoenix's existing evaluation framework to attribute final answer quality degradation back to specific retrieval spans — closing the loop between traces and evals.

Why This Matters

For production RAG in financial/regulated domains, debugging "why did this answer fail?" requires span-level attribution, not just final output quality. Phoenix already has the trace infrastructure — adding context_confidence as a standard span attribute would make it the best tool for retrieval chain debugging.

Minimal Implementation Path

  1. Add context_confidence to the OpenInference RAG span spec
  2. Expose it in Phoenix's trace detail view as a per-span quality indicator
  3. Allow filtering/sorting spans by context_confidence in the UI
  4. Add confidence_delta between parent/child retrieval spans — surfaces degradation points automatically

Related Work

I explored this problem in my finrag-eval project and had to instrument span confidence manually using custom attributes. Standardizing this in Phoenix/OpenInference would make it available to the whole ecosystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesttriageIssues that require triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    📘 Todo
    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions