Skip to content

Latest commit

 

History

History
32 lines (24 loc) · 2.09 KB

File metadata and controls

32 lines (24 loc) · 2.09 KB

Roadmap — RAG Evaluation Lab

Milestone 1: Evaluation harness (Done)

  • Real embedding-backed retrieval over shared_core.vectorstore (in-memory + pgvector).
  • Ingestion pipeline: chunk (shared_core.docparse) → SHA-256 dedup → embed → index.
  • Three registered judges (retrieval hit rate, answer groundedness, citation coverage) wrapping preserved golden baselines.
  • Golden-question runner with aggregate metrics and shared_core.llmmetrics cost/latency.
  • Strategy comparison (A/B diff of the same goldens under two configs).
  • CI regression gate (rag-eval-gate, non-zero exit on metric drop) + saved baseline.
  • Run persistence (DB-by-default with in-memory fallback) + alembic migration.
  • Full API (/ingest, /query, /eval/run[/async], /eval/results[/{id}], /eval/compare), Celery worker, comprehensive offline tests, and docs.

Milestone 2: Richer metrics & judging (Next)

  • LLM-as-judge groundedness — opt-in, behind a fixed rubric prompt, treated as untrusted output; consensus across providers for bias control.
  • MRR / NDCG — rank-aware retrieval metrics beyond binary hit rate.
  • Faithfulness / relevance — sentence-level claim support scoring.
  • Per-question drift — wire shared_core.evaljudge.DriftDetector for per-result regression alongside the aggregate gate.

Milestone 3: Advanced retrieval (Future)

  • Hybrid retrieval (RRF) — combine BM25 full-text with vector search via reciprocal rank fusion.
  • Re-ranking — second-stage cross-encoder re-ranker (opt-in, keyed).
  • More chunkers — recursive/character splitters, overlap sweeps as first-class comparison axes.

Milestone 4: Scale & ecosystem (Future)

  • Synthetic golden bootstrapping — generate Q/gold-context pairs from raw corpora.
  • Cost dashboards — per-provider spend over time from the persisted telemetry.
  • Portfolio integration — ingest from document-intelligence-pipeline; feed failed production queries (from the LLM monitor) back into the golden set.
  • Dashboard UI — a later stage consumes the already-exposed API (no UI bundled here by design).