Background
Phoenix is excellent for LLM/RAG tracing and observability. I've been using it for my finrag-eval project, which evaluates multi-hop RAG pipelines over financial documents.
Phoenix traces capture the full span tree for retrieval chains, but the span-level data doesn't expose context quality signals — only the final retrieved context and output.
The Gap
For multi-hop RAG pipelines, the most important observability signal isn't the final output quality — it's which retrieval span degraded the chain.
Currently, Phoenix shows:
Span 1: retrieval (0.8s) ✓
Span 2: reranking (0.3s) ✓
Span 3: synthesis (1.2s) ✓
Final answer: [text]
What's missing:
Span 1: retrieval — context_confidence: 0.92, retrieved_chunks: 5
Span 2: reranking — context_confidence: 0.65, chunks_retained: 2 ← DEGRADATION HERE
Span 3: synthesis — context_confidence: 0.65 (inherited from upstream)
Final answer quality: 0.61 (directly caused by Span 2 degradation)
Feature Request
Add span-level context confidence as a first-class attribute in Phoenix's RAG span schema:
# Proposed: Phoenix span attributes for RAG spans
span.set_attribute("retrieval.context_confidence", 0.92) # Per-span quality score
span.set_attribute("retrieval.chunks_retrieved", 5)
span.set_attribute("retrieval.chunks_above_threshold", 5)
span.set_attribute("reranking.context_confidence", 0.65) # Post-rerank quality
span.set_attribute("reranking.confidence_delta", -0.27) # Signal: quality dropped here
This would enable Phoenix's existing evaluation framework to attribute final answer quality degradation back to specific retrieval spans — closing the loop between traces and evals.
Why This Matters
For production RAG in financial/regulated domains, debugging "why did this answer fail?" requires span-level attribution, not just final output quality. Phoenix already has the trace infrastructure — adding context_confidence as a standard span attribute would make it the best tool for retrieval chain debugging.
Minimal Implementation Path
- Add
context_confidence to the OpenInference RAG span spec
- Expose it in Phoenix's trace detail view as a per-span quality indicator
- Allow filtering/sorting spans by
context_confidence in the UI
- Add
confidence_delta between parent/child retrieval spans — surfaces degradation points automatically
Related Work
I explored this problem in my finrag-eval project and had to instrument span confidence manually using custom attributes. Standardizing this in Phoenix/OpenInference would make it available to the whole ecosystem.
Background
Phoenix is excellent for LLM/RAG tracing and observability. I've been using it for my
finrag-evalproject, which evaluates multi-hop RAG pipelines over financial documents.Phoenix traces capture the full span tree for retrieval chains, but the span-level data doesn't expose context quality signals — only the final retrieved context and output.
The Gap
For multi-hop RAG pipelines, the most important observability signal isn't the final output quality — it's which retrieval span degraded the chain.
Currently, Phoenix shows:
What's missing:
Feature Request
Add span-level context confidence as a first-class attribute in Phoenix's RAG span schema:
This would enable Phoenix's existing evaluation framework to attribute final answer quality degradation back to specific retrieval spans — closing the loop between traces and evals.
Why This Matters
For production RAG in financial/regulated domains, debugging "why did this answer fail?" requires span-level attribution, not just final output quality. Phoenix already has the trace infrastructure — adding
context_confidenceas a standard span attribute would make it the best tool for retrieval chain debugging.Minimal Implementation Path
context_confidenceto the OpenInference RAG span speccontext_confidencein the UIconfidence_deltabetween parent/child retrieval spans — surfaces degradation points automaticallyRelated Work
I explored this problem in my
finrag-evalproject and had to instrument span confidence manually using custom attributes. Standardizing this in Phoenix/OpenInference would make it available to the whole ecosystem.