Roadmap — RAG Evaluation Lab

Milestone 1: Evaluation harness (Done)

Real embedding-backed retrieval over shared_core.vectorstore (in-memory + pgvector).
Ingestion pipeline: chunk (shared_core.docparse) → SHA-256 dedup → embed → index.
Three registered judges (retrieval hit rate, answer groundedness, citation coverage) wrapping preserved golden baselines.
Golden-question runner with aggregate metrics and shared_core.llmmetrics cost/latency.
Strategy comparison (A/B diff of the same goldens under two configs).
CI regression gate (rag-eval-gate, non-zero exit on metric drop) + saved baseline.
Run persistence (DB-by-default with in-memory fallback) + alembic migration.
Full API (/ingest, /query, /eval/run[/async], /eval/results[/{id}], /eval/compare), Celery worker, comprehensive offline tests, and docs.

LLM-as-judge groundedness — opt-in, behind a fixed rubric prompt, treated as untrusted output; consensus across providers for bias control.
MRR / NDCG — rank-aware retrieval metrics beyond binary hit rate.
Faithfulness / relevance — sentence-level claim support scoring.
Per-question drift — wire shared_core.evaljudge.DriftDetector for per-result regression alongside the aggregate gate.

Hybrid retrieval (RRF) — combine BM25 full-text with vector search via reciprocal rank fusion.
Re-ranking — second-stage cross-encoder re-ranker (opt-in, keyed).
More chunkers — recursive/character splitters, overlap sweeps as first-class comparison axes.

Synthetic golden bootstrapping — generate Q/gold-context pairs from raw corpora.
Cost dashboards — per-provider spend over time from the persisted telemetry.
Portfolio integration — ingest from document-intelligence-pipeline; feed failed production queries (from the LLM monitor) back into the golden set.
Dashboard UI — a later stage consumes the already-exposed API (no UI bundled here by design).