- Real embedding-backed retrieval over
shared_core.vectorstore(in-memory + pgvector). - Ingestion pipeline: chunk (
shared_core.docparse) → SHA-256 dedup → embed → index. - Three registered judges (retrieval hit rate, answer groundedness, citation coverage) wrapping preserved golden baselines.
- Golden-question runner with aggregate metrics and
shared_core.llmmetricscost/latency. - Strategy comparison (A/B diff of the same goldens under two configs).
- CI regression gate (
rag-eval-gate, non-zero exit on metric drop) + saved baseline. - Run persistence (DB-by-default with in-memory fallback) + alembic migration.
- Full API (
/ingest,/query,/eval/run[/async],/eval/results[/{id}],/eval/compare), Celery worker, comprehensive offline tests, and docs.
- LLM-as-judge groundedness — opt-in, behind a fixed rubric prompt, treated as untrusted output; consensus across providers for bias control.
- MRR / NDCG — rank-aware retrieval metrics beyond binary hit rate.
- Faithfulness / relevance — sentence-level claim support scoring.
- Per-question drift — wire
shared_core.evaljudge.DriftDetectorfor per-result regression alongside the aggregate gate.
- Hybrid retrieval (RRF) — combine BM25 full-text with vector search via reciprocal rank fusion.
- Re-ranking — second-stage cross-encoder re-ranker (opt-in, keyed).
- More chunkers — recursive/character splitters, overlap sweeps as first-class comparison axes.
- Synthetic golden bootstrapping — generate Q/gold-context pairs from raw corpora.
- Cost dashboards — per-provider spend over time from the persisted telemetry.
- Portfolio integration — ingest from document-intelligence-pipeline; feed failed production queries (from the LLM monitor) back into the golden set.
- Dashboard UI — a later stage consumes the already-exposed API (no UI bundled here by design).