A measurement harness for Retrieval-Augmented Generation: it scores retrieval hit rate, answer groundedness, and citation coverage over a golden-question set, compares chunking/retrieval strategies head-to-head, and fails CI when quality regresses — all runnable offline with no database, no API keys, and no network.
Every RAG tutorial shows how to stuff documents into a vector store and get answers out. Almost none answer the harder questions: Is the retriever returning the right chunks? Is the generated answer grounded in those chunks or hallucinated? Do the citations actually trace to real sources? Did my last change make any of that worse? Without numbers, you're shipping on vibes.
RAG Evaluation Lab closes that gap. It runs a real retrieval pipeline (embed → store → cosine search), scores each golden question with composable judges, persists every run, diffs strategies, and exposes a CI regression gate that exits non-zero when a metric drops below a saved baseline. It is offline-first: the default embedding provider is a deterministic hash fallback (no sentence-transformers, no torch), the default store is in-memory, and the default generator is a deterministic extractive answerer — so the whole thing is reproducible in tests and CI. When API keys / a database are present, the same code path uses real OpenAI/Anthropic generation and pgvector persistence.
- Real retrieval, not a mock — embeds the corpus with
shared_core.embeddingsand stores vectors inshared_core.vectorstore(InMemoryVectorStoreoffline,PgVectorStorewhen a DB is configured); retrieval is genuine cosine-similarity search. - An ingestion pipeline — corpus →
chunk_text(fixed / semantic / structural) → SHA-256 dedup → embed → store, with provenance back to source documents. - Evaluation as composable judges —
retrieval_hit_rate,answer_groundedness, andcitation_coverageareshared_core.evaljudge.Judgesubclasses registered in the shared registry, reusingCitationJudge/SemanticMatchJudge. The original scoring functions are preserved verbatim as golden-output baselines so scores can't silently drift. - Strategy comparison — run the same goldens under two retrieval configs (e.g. semantic vs fixed chunking) and diff the per-metric scores to see which retrieves better.
- A CI regression gate — compares a run against a saved baseline and exits non-zero on a regression (
rag-eval-gate), so quality is enforced like a test. - Cost & latency reporting — generation telemetry flows through
shared_core.llmmetrics(tokens, USD cost, p50/p95 latency). - Persistence with graceful fallback — eval runs persist to PostgreSQL by default via a 2-second DB-availability probe, falling back to an in-memory store so tests and the demo need no database.
graph TB
subgraph Ingestion
DOCS["Documents / corpus"] --> CHUNK["chunk_text<br/>(fixed / semantic / structural)"]
CHUNK --> DEDUP["SHA-256 dedup"]
DEDUP --> EMBED["get_embedding_provider<br/>(hash fallback offline · OpenAI when keyed)"]
EMBED --> STORE["VectorStore<br/>(InMemory offline · pgvector when DB set)"]
end
subgraph "Query / Generation"
Q["Query"] --> RET["VectorSearchEngine<br/>cosine top-k"]
STORE --> RET
RET --> GEN["RAGAnswerGenerator<br/>(extractive offline · LLM when keyed)"]
GEN --> ANS["Answer + citations + telemetry"]
end
subgraph "Evaluation"
GOLD["Golden questions (.jsonl)"] --> RUN["GoldenRunner"]
RET --> RUN
ANS --> RUN
RUN --> J1["RetrievalHitRateJudge"]
RUN --> J2["AnswerGroundednessJudge"]
RUN --> J3["CitationCoverageJudge"]
J1 & J2 & J3 --> AGG["EvalRun metrics + llmmetrics cost/latency"]
AGG --> CMP["compare_strategies (A/B diff)"]
AGG --> GATE["CI regression gate"]
end
subgraph "Persistence & API"
AGG --> EVALSTORE["EvalStore<br/>(InMemory offline · SQLAlchemy when DB set)"]
API["FastAPI:<br/>/ingest /query /eval/run<br/>/eval/results /eval/compare"] --> RUN
API --> EVALSTORE
WORKER["Celery worker<br/>run_evaluation_task"] --> RUN
end
| Component | Choice | Rationale |
|---|---|---|
| API Framework | FastAPI + Uvicorn | Async-ready, auto OpenAPI, Pydantic validation |
| Retrieval | shared_core.vectorstore + shared_core.embeddings |
One interface for in-memory (offline) and pgvector (prod) |
| Chunking | shared_core.docparse.chunk_text |
Fixed / semantic / structural strategies, shared across the workspace |
| Judges | shared_core.evaljudge |
Uniform JudgeResult; reuses CitationJudge/SemanticMatchJudge |
| Cost / latency | shared_core.llmmetrics |
Token, USD, and p50/p95/p99 latency aggregation |
| Vector DB | PostgreSQL 16 + pgvector | Embeddings beside relational metadata (optional) |
| Broker / Queue | Redis 7 + Celery 5.3 | Async evaluation runs |
| Persistence | SQLAlchemy 2.0 + Alembic | eval_runs table; migrations in alembic/ |
| Config | pydantic-settings | BaseAppConfig → AppConfig |
| Lint / Test | ruff + pytest | E,W,F,I,C,B; 88-char; FastAPI TestClient |
cd rag-evaluation-lab
# Offline-first: no DB, no keys, no network needed.
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ../shared-core[docparse] numpy
pip install -e ".[dev]"
# (optional) real persistence + providers
docker compose up -d # PostgreSQL (pgvector) + Redis
pip install -e ".[db,llm]"
cp .env.example .env # set DATABASE_URL / OPENAI_API_KEY when you want them
alembic upgrade head # provision the eval_runs table
# run the API
make dev # uvicorn on 0.0.0.0:8000
curl http://localhost:8000/healthmake demo # python examples/run_demo.py — offline, deterministic, exits 0The demo walks the entire flow: ingest the sample corpus, run an ad-hoc RAG query, evaluate all 10 golden questions (hit rate / groundedness / citation coverage), compare semantic vs fixed chunking, and run the CI regression gate against the shipped baseline. Sample output:
--- 3. Golden-question evaluation (semantic chunking) ---
# RAG Evaluation Run: semantic
- Avg hit rate: 0.400
- Avg groundedness: 0.807
- Avg citation coverage: 1.000
--- 4. Strategy comparison: semantic vs fixed ---
winner: semantic
--- 5. CI regression gate (vs saved baseline) ---
gate passed: True (exit_code=0)
make test # pytest — 117 tests, all offline (no DB / Redis / keys / network)Coverage spans every core module: chunking + ingestion (test_ingestion.py), embedding-backed retrieval over the vector store (test_retrieval.py), each judge with golden-output regression cases (test_evals.py), the golden runner + strategy comparison (test_runner.py), the CI gate (test_gate.py), in-memory and SQLite-backed run persistence (test_store.py), the Celery worker task with no broker (test_worker.py), generation incl. the mocked-LLM telemetry path (test_generation.py), reporting (test_reports.py), every API endpoint success + error (test_api.py), the CLI gate (test_cli.py), and an end-to-end integration flow (test_core.py).
| Method | Path | Description |
|---|---|---|
POST |
/ingest |
Chunk, embed, and index documents into the retrieval engine |
POST |
/query |
Retrieve top-k chunks and generate a grounded answer (use_llm / mocked_response optional) |
POST |
/eval/run |
Run the golden questions through the full pipeline; persists by default |
POST |
/eval/run/async |
Dispatch an evaluation run to the Celery worker (requires a broker) |
GET |
/eval/results |
List persisted run summaries (newest first) |
GET |
/eval/results/{id} |
Full payload of one persisted run (404 if unknown) |
GET |
/eval/compare |
Compare the bundled goldens under two built-in chunk strategies |
POST |
/eval/compare |
Compare two caller-supplied configs over supplied goldens/corpus |
GET |
/health |
Service health; reports DB/Redis status (degraded + DB offline in offline mode) |
Everything a dashboard needs (run history, per-question judge breakdowns, deltas, cost/latency) is returned as JSON — no UI is bundled by design.
rag-eval-gate # run goldens over the sample corpus, gate vs baseline (exit 1 on regression)
rag-eval-gate --strategy fixed # evaluate a different chunking strategy
rag-eval-gate --update-baseline # overwrite datasets/baseline_metrics.json with this run
rag-eval-gate --json # machine-readable output for CI| Variable | Default | Description |
|---|---|---|
APP_NAME |
rag-evaluation-lab |
Service identifier in logs / health |
LOG_LEVEL |
INFO |
Loguru level |
DATABASE_URL |
postgresql+psycopg://... |
Postgres (pgvector); probed with a 2s timeout, falls back to in-memory |
REDIS_URL |
redis://localhost:6379/0 |
Celery broker + result backend |
OPENAI_API_KEY |
— | Enables real OpenAI embeddings + generation when set |
ANTHROPIC_API_KEY |
— | Enables real Anthropic generation when set |
- Offline embeddings are deterministic, not semantic.
HashFallbackProviderproduces stable but near-orthogonal vectors, so offline retrieval ranks correctly on lexical signal more than meaning. WithOPENAI_API_KEYset the pipeline uses real embeddings and hit rate rises accordingly. Offline hit rate over the bundled corpus (~0.40) is the honest, reproducible baseline — not a ceiling. - Groundedness is a lexical heuristic.
calculate_answer_groundednesscounts answer/source word overlap (short words always count). It is a cheap proxy, kept as a stable golden baseline; an LLM-as-judge groundedness scorer is on the roadmap. - Async runs need a broker.
/eval/run/asyncrequires Redis; the synchronous/eval/runand the worker's importable task helper work with no broker. - No frontend. By design — the API exposes all data a dashboard would render.
See docs/roadmap.md and docs/EXECUTION_PLAN.md. Highlights: LLM-as-judge groundedness, hybrid (BM25 + vector) retrieval, MRR/NDCG metrics, per-provider cost dashboards, and export of eval runs to the broader portfolio.
MIT — see LICENSE.