RAG Evaluation Lab

A measurement harness for Retrieval-Augmented Generation: it scores retrieval hit rate, answer groundedness, and citation coverage over a golden-question set, compares chunking/retrieval strategies head-to-head, and fails CI when quality regresses — all runnable offline with no database, no API keys, and no network.

Why This Exists

Every RAG tutorial shows how to stuff documents into a vector store and get answers out. Almost none answer the harder questions: Is the retriever returning the right chunks? Is the generated answer grounded in those chunks or hallucinated? Do the citations actually trace to real sources? Did my last change make any of that worse? Without numbers, you're shipping on vibes.

RAG Evaluation Lab closes that gap. It runs a real retrieval pipeline (embed → store → cosine search), scores each golden question with composable judges, persists every run, diffs strategies, and exposes a CI regression gate that exits non-zero when a metric drops below a saved baseline. It is offline-first: the default embedding provider is a deterministic hash fallback (no sentence-transformers, no torch), the default store is in-memory, and the default generator is a deterministic extractive answerer — so the whole thing is reproducible in tests and CI. When API keys / a database are present, the same code path uses real OpenAI/Anthropic generation and pgvector persistence.

What It Demonstrates

Real retrieval, not a mock — embeds the corpus with shared_core.embeddings and stores vectors in shared_core.vectorstore (InMemoryVectorStore offline, PgVectorStore when a DB is configured); retrieval is genuine cosine-similarity search.
An ingestion pipeline — corpus → chunk_text (fixed / semantic / structural) → SHA-256 dedup → embed → store, with provenance back to source documents.
Evaluation as composable judges — retrieval_hit_rate, answer_groundedness, and citation_coverage are shared_core.evaljudge.Judge subclasses registered in the shared registry, reusing CitationJudge/SemanticMatchJudge. The original scoring functions are preserved verbatim as golden-output baselines so scores can't silently drift.
Strategy comparison — run the same goldens under two retrieval configs (e.g. semantic vs fixed chunking) and diff the per-metric scores to see which retrieves better.
A CI regression gate — compares a run against a saved baseline and exits non-zero on a regression (rag-eval-gate), so quality is enforced like a test.
Cost & latency reporting — generation telemetry flows through shared_core.llmmetrics (tokens, USD cost, p50/p95 latency).
Persistence with graceful fallback — eval runs persist to PostgreSQL by default via a 2-second DB-availability probe, falling back to an in-memory store so tests and the demo need no database.

Architecture

graph TB
    subgraph Ingestion
        DOCS["Documents / corpus"] --> CHUNK["chunk_text<br/>(fixed / semantic / structural)"]
        CHUNK --> DEDUP["SHA-256 dedup"]
        DEDUP --> EMBED["get_embedding_provider<br/>(hash fallback offline · OpenAI when keyed)"]
        EMBED --> STORE["VectorStore<br/>(InMemory offline · pgvector when DB set)"]
    end

    subgraph "Query / Generation"
        Q["Query"] --> RET["VectorSearchEngine<br/>cosine top-k"]
        STORE --> RET
        RET --> GEN["RAGAnswerGenerator<br/>(extractive offline · LLM when keyed)"]
        GEN --> ANS["Answer + citations + telemetry"]
    end

    subgraph "Evaluation"
        GOLD["Golden questions (.jsonl)"] --> RUN["GoldenRunner"]
        RET --> RUN
        ANS --> RUN
        RUN --> J1["RetrievalHitRateJudge"]
        RUN --> J2["AnswerGroundednessJudge"]
        RUN --> J3["CitationCoverageJudge"]
        J1 & J2 & J3 --> AGG["EvalRun metrics + llmmetrics cost/latency"]
        AGG --> CMP["compare_strategies (A/B diff)"]
        AGG --> GATE["CI regression gate"]
    end

    subgraph "Persistence & API"
        AGG --> EVALSTORE["EvalStore<br/>(InMemory offline · SQLAlchemy when DB set)"]
        API["FastAPI:<br/>/ingest /query /eval/run<br/>/eval/results /eval/compare"] --> RUN
        API --> EVALSTORE
        WORKER["Celery worker<br/>run_evaluation_task"] --> RUN
    end

Tech Stack

Component	Choice	Rationale
API Framework	FastAPI + Uvicorn	Async-ready, auto OpenAPI, Pydantic validation
Retrieval	`shared_core.vectorstore` + `shared_core.embeddings`	One interface for in-memory (offline) and pgvector (prod)
Chunking	`shared_core.docparse.chunk_text`	Fixed / semantic / structural strategies, shared across the workspace
Judges	`shared_core.evaljudge`	Uniform `JudgeResult`; reuses `CitationJudge`/`SemanticMatchJudge`
Cost / latency	`shared_core.llmmetrics`	Token, USD, and p50/p95/p99 latency aggregation
Vector DB	PostgreSQL 16 + pgvector	Embeddings beside relational metadata (optional)
Broker / Queue	Redis 7 + Celery 5.3	Async evaluation runs
Persistence	SQLAlchemy 2.0 + Alembic	`eval_runs` table; migrations in `alembic/`
Config	pydantic-settings	`BaseAppConfig` → `AppConfig`
Lint / Test	ruff + pytest	E,W,F,I,C,B; 88-char; FastAPI TestClient

Local Setup

cd rag-evaluation-lab

# Offline-first: no DB, no keys, no network needed.
python -m venv .venv && source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install -e ../shared-core[docparse] numpy
pip install -e ".[dev]"

# (optional) real persistence + providers
docker compose up -d          # PostgreSQL (pgvector) + Redis
pip install -e ".[db,llm]"
cp .env.example .env          # set DATABASE_URL / OPENAI_API_KEY when you want them
alembic upgrade head          # provision the eval_runs table

# run the API
make dev                      # uvicorn on 0.0.0.0:8000
curl http://localhost:8000/health

Demo

make demo            # python examples/run_demo.py — offline, deterministic, exits 0

The demo walks the entire flow: ingest the sample corpus, run an ad-hoc RAG query, evaluate all 10 golden questions (hit rate / groundedness / citation coverage), compare semantic vs fixed chunking, and run the CI regression gate against the shipped baseline. Sample output:

--- 3. Golden-question evaluation (semantic chunking) ---
# RAG Evaluation Run: semantic
- Avg hit rate: 0.400
- Avg groundedness: 0.807
- Avg citation coverage: 1.000

--- 4. Strategy comparison: semantic vs fixed ---
winner: semantic

--- 5. CI regression gate (vs saved baseline) ---
gate passed: True (exit_code=0)

Tests

make test            # pytest — 117 tests, all offline (no DB / Redis / keys / network)

Coverage spans every core module: chunking + ingestion (test_ingestion.py), embedding-backed retrieval over the vector store (test_retrieval.py), each judge with golden-output regression cases (test_evals.py), the golden runner + strategy comparison (test_runner.py), the CI gate (test_gate.py), in-memory and SQLite-backed run persistence (test_store.py), the Celery worker task with no broker (test_worker.py), generation incl. the mocked-LLM telemetry path (test_generation.py), reporting (test_reports.py), every API endpoint success + error (test_api.py), the CLI gate (test_cli.py), and an end-to-end integration flow (test_core.py).

API Reference

Method	Path	Description
`POST`	`/ingest`	Chunk, embed, and index documents into the retrieval engine
`POST`	`/query`	Retrieve top-k chunks and generate a grounded answer (`use_llm` / `mocked_response` optional)
`POST`	`/eval/run`	Run the golden questions through the full pipeline; persists by default
`POST`	`/eval/run/async`	Dispatch an evaluation run to the Celery worker (requires a broker)
`GET`	`/eval/results`	List persisted run summaries (newest first)
`GET`	`/eval/results/{id}`	Full payload of one persisted run (404 if unknown)
`GET`	`/eval/compare`	Compare the bundled goldens under two built-in chunk strategies
`POST`	`/eval/compare`	Compare two caller-supplied configs over supplied goldens/corpus
`GET`	`/health`	Service health; reports DB/Redis status (degraded + DB offline in offline mode)

Everything a dashboard needs (run history, per-question judge breakdowns, deltas, cost/latency) is returned as JSON — no UI is bundled by design.

CLI

rag-eval-gate                    # run goldens over the sample corpus, gate vs baseline (exit 1 on regression)
rag-eval-gate --strategy fixed   # evaluate a different chunking strategy
rag-eval-gate --update-baseline  # overwrite datasets/baseline_metrics.json with this run
rag-eval-gate --json             # machine-readable output for CI

Configuration

Variable	Default	Description
`APP_NAME`	`rag-evaluation-lab`	Service identifier in logs / health
`LOG_LEVEL`	`INFO`	Loguru level
`DATABASE_URL`	`postgresql+psycopg://...`	Postgres (pgvector); probed with a 2s timeout, falls back to in-memory
`REDIS_URL`	`redis://localhost:6379/0`	Celery broker + result backend
`OPENAI_API_KEY`	—	Enables real OpenAI embeddings + generation when set
`ANTHROPIC_API_KEY`	—	Enables real Anthropic generation when set

Known Limitations

Offline embeddings are deterministic, not semantic. HashFallbackProvider produces stable but near-orthogonal vectors, so offline retrieval ranks correctly on lexical signal more than meaning. With OPENAI_API_KEY set the pipeline uses real embeddings and hit rate rises accordingly. Offline hit rate over the bundled corpus (~0.40) is the honest, reproducible baseline — not a ceiling.
Groundedness is a lexical heuristic. calculate_answer_groundedness counts answer/source word overlap (short words always count). It is a cheap proxy, kept as a stable golden baseline; an LLM-as-judge groundedness scorer is on the roadmap.
Async runs need a broker. /eval/run/async requires Redis; the synchronous /eval/run and the worker's importable task helper work with no broker.
No frontend. By design — the API exposes all data a dashboard would render.

Roadmap

See docs/roadmap.md and docs/EXECUTION_PLAN.md. Highlights: LLM-as-judge groundedness, hybrid (BM25 + vector) retrieval, MRR/NDCG metrics, per-provider cost dashboards, and export of eval runs to the broader portfolio.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
alembic		alembic
datasets		datasets
docs		docs
examples		examples
frontend		frontend
src/rag_lab		src/rag_lab
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Evaluation Lab

Why This Exists

What It Demonstrates

Architecture

Tech Stack

Local Setup

Demo

Tests

API Reference

CLI

Configuration

Known Limitations

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Evaluation Lab

Why This Exists

What It Demonstrates

Architecture

Tech Stack

Local Setup

Demo

Tests

API Reference

CLI

Configuration

Known Limitations

Roadmap

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages