Skip to content

FishRaposo/rag-evaluation-lab

Repository files navigation

RAG Evaluation Lab

Python 3.10+ FastAPI PostgreSQL pgvector Redis Celery License: MIT

A measurement harness for Retrieval-Augmented Generation: it scores retrieval hit rate, answer groundedness, and citation coverage over a golden-question set, compares chunking/retrieval strategies head-to-head, and fails CI when quality regresses — all runnable offline with no database, no API keys, and no network.

Why This Exists

Every RAG tutorial shows how to stuff documents into a vector store and get answers out. Almost none answer the harder questions: Is the retriever returning the right chunks? Is the generated answer grounded in those chunks or hallucinated? Do the citations actually trace to real sources? Did my last change make any of that worse? Without numbers, you're shipping on vibes.

RAG Evaluation Lab closes that gap. It runs a real retrieval pipeline (embed → store → cosine search), scores each golden question with composable judges, persists every run, diffs strategies, and exposes a CI regression gate that exits non-zero when a metric drops below a saved baseline. It is offline-first: the default embedding provider is a deterministic hash fallback (no sentence-transformers, no torch), the default store is in-memory, and the default generator is a deterministic extractive answerer — so the whole thing is reproducible in tests and CI. When API keys / a database are present, the same code path uses real OpenAI/Anthropic generation and pgvector persistence.

What It Demonstrates

  • Real retrieval, not a mock — embeds the corpus with shared_core.embeddings and stores vectors in shared_core.vectorstore (InMemoryVectorStore offline, PgVectorStore when a DB is configured); retrieval is genuine cosine-similarity search.
  • An ingestion pipeline — corpus → chunk_text (fixed / semantic / structural) → SHA-256 dedup → embed → store, with provenance back to source documents.
  • Evaluation as composable judgesretrieval_hit_rate, answer_groundedness, and citation_coverage are shared_core.evaljudge.Judge subclasses registered in the shared registry, reusing CitationJudge/SemanticMatchJudge. The original scoring functions are preserved verbatim as golden-output baselines so scores can't silently drift.
  • Strategy comparison — run the same goldens under two retrieval configs (e.g. semantic vs fixed chunking) and diff the per-metric scores to see which retrieves better.
  • A CI regression gate — compares a run against a saved baseline and exits non-zero on a regression (rag-eval-gate), so quality is enforced like a test.
  • Cost & latency reporting — generation telemetry flows through shared_core.llmmetrics (tokens, USD cost, p50/p95 latency).
  • Persistence with graceful fallback — eval runs persist to PostgreSQL by default via a 2-second DB-availability probe, falling back to an in-memory store so tests and the demo need no database.

Architecture

graph TB
    subgraph Ingestion
        DOCS["Documents / corpus"] --> CHUNK["chunk_text<br/>(fixed / semantic / structural)"]
        CHUNK --> DEDUP["SHA-256 dedup"]
        DEDUP --> EMBED["get_embedding_provider<br/>(hash fallback offline · OpenAI when keyed)"]
        EMBED --> STORE["VectorStore<br/>(InMemory offline · pgvector when DB set)"]
    end

    subgraph "Query / Generation"
        Q["Query"] --> RET["VectorSearchEngine<br/>cosine top-k"]
        STORE --> RET
        RET --> GEN["RAGAnswerGenerator<br/>(extractive offline · LLM when keyed)"]
        GEN --> ANS["Answer + citations + telemetry"]
    end

    subgraph "Evaluation"
        GOLD["Golden questions (.jsonl)"] --> RUN["GoldenRunner"]
        RET --> RUN
        ANS --> RUN
        RUN --> J1["RetrievalHitRateJudge"]
        RUN --> J2["AnswerGroundednessJudge"]
        RUN --> J3["CitationCoverageJudge"]
        J1 & J2 & J3 --> AGG["EvalRun metrics + llmmetrics cost/latency"]
        AGG --> CMP["compare_strategies (A/B diff)"]
        AGG --> GATE["CI regression gate"]
    end

    subgraph "Persistence & API"
        AGG --> EVALSTORE["EvalStore<br/>(InMemory offline · SQLAlchemy when DB set)"]
        API["FastAPI:<br/>/ingest /query /eval/run<br/>/eval/results /eval/compare"] --> RUN
        API --> EVALSTORE
        WORKER["Celery worker<br/>run_evaluation_task"] --> RUN
    end
Loading

Tech Stack

Component Choice Rationale
API Framework FastAPI + Uvicorn Async-ready, auto OpenAPI, Pydantic validation
Retrieval shared_core.vectorstore + shared_core.embeddings One interface for in-memory (offline) and pgvector (prod)
Chunking shared_core.docparse.chunk_text Fixed / semantic / structural strategies, shared across the workspace
Judges shared_core.evaljudge Uniform JudgeResult; reuses CitationJudge/SemanticMatchJudge
Cost / latency shared_core.llmmetrics Token, USD, and p50/p95/p99 latency aggregation
Vector DB PostgreSQL 16 + pgvector Embeddings beside relational metadata (optional)
Broker / Queue Redis 7 + Celery 5.3 Async evaluation runs
Persistence SQLAlchemy 2.0 + Alembic eval_runs table; migrations in alembic/
Config pydantic-settings BaseAppConfigAppConfig
Lint / Test ruff + pytest E,W,F,I,C,B; 88-char; FastAPI TestClient

Local Setup

cd rag-evaluation-lab

# Offline-first: no DB, no keys, no network needed.
python -m venv .venv && source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install -e ../shared-core[docparse] numpy
pip install -e ".[dev]"

# (optional) real persistence + providers
docker compose up -d          # PostgreSQL (pgvector) + Redis
pip install -e ".[db,llm]"
cp .env.example .env          # set DATABASE_URL / OPENAI_API_KEY when you want them
alembic upgrade head          # provision the eval_runs table

# run the API
make dev                      # uvicorn on 0.0.0.0:8000
curl http://localhost:8000/health

Demo

make demo            # python examples/run_demo.py — offline, deterministic, exits 0

The demo walks the entire flow: ingest the sample corpus, run an ad-hoc RAG query, evaluate all 10 golden questions (hit rate / groundedness / citation coverage), compare semantic vs fixed chunking, and run the CI regression gate against the shipped baseline. Sample output:

--- 3. Golden-question evaluation (semantic chunking) ---
# RAG Evaluation Run: semantic
- Avg hit rate: 0.400
- Avg groundedness: 0.807
- Avg citation coverage: 1.000

--- 4. Strategy comparison: semantic vs fixed ---
winner: semantic

--- 5. CI regression gate (vs saved baseline) ---
gate passed: True (exit_code=0)

Tests

make test            # pytest — 117 tests, all offline (no DB / Redis / keys / network)

Coverage spans every core module: chunking + ingestion (test_ingestion.py), embedding-backed retrieval over the vector store (test_retrieval.py), each judge with golden-output regression cases (test_evals.py), the golden runner + strategy comparison (test_runner.py), the CI gate (test_gate.py), in-memory and SQLite-backed run persistence (test_store.py), the Celery worker task with no broker (test_worker.py), generation incl. the mocked-LLM telemetry path (test_generation.py), reporting (test_reports.py), every API endpoint success + error (test_api.py), the CLI gate (test_cli.py), and an end-to-end integration flow (test_core.py).

API Reference

Method Path Description
POST /ingest Chunk, embed, and index documents into the retrieval engine
POST /query Retrieve top-k chunks and generate a grounded answer (use_llm / mocked_response optional)
POST /eval/run Run the golden questions through the full pipeline; persists by default
POST /eval/run/async Dispatch an evaluation run to the Celery worker (requires a broker)
GET /eval/results List persisted run summaries (newest first)
GET /eval/results/{id} Full payload of one persisted run (404 if unknown)
GET /eval/compare Compare the bundled goldens under two built-in chunk strategies
POST /eval/compare Compare two caller-supplied configs over supplied goldens/corpus
GET /health Service health; reports DB/Redis status (degraded + DB offline in offline mode)

Everything a dashboard needs (run history, per-question judge breakdowns, deltas, cost/latency) is returned as JSON — no UI is bundled by design.

CLI

rag-eval-gate                    # run goldens over the sample corpus, gate vs baseline (exit 1 on regression)
rag-eval-gate --strategy fixed   # evaluate a different chunking strategy
rag-eval-gate --update-baseline  # overwrite datasets/baseline_metrics.json with this run
rag-eval-gate --json             # machine-readable output for CI

Configuration

Variable Default Description
APP_NAME rag-evaluation-lab Service identifier in logs / health
LOG_LEVEL INFO Loguru level
DATABASE_URL postgresql+psycopg://... Postgres (pgvector); probed with a 2s timeout, falls back to in-memory
REDIS_URL redis://localhost:6379/0 Celery broker + result backend
OPENAI_API_KEY Enables real OpenAI embeddings + generation when set
ANTHROPIC_API_KEY Enables real Anthropic generation when set

Known Limitations

  1. Offline embeddings are deterministic, not semantic. HashFallbackProvider produces stable but near-orthogonal vectors, so offline retrieval ranks correctly on lexical signal more than meaning. With OPENAI_API_KEY set the pipeline uses real embeddings and hit rate rises accordingly. Offline hit rate over the bundled corpus (~0.40) is the honest, reproducible baseline — not a ceiling.
  2. Groundedness is a lexical heuristic. calculate_answer_groundedness counts answer/source word overlap (short words always count). It is a cheap proxy, kept as a stable golden baseline; an LLM-as-judge groundedness scorer is on the roadmap.
  3. Async runs need a broker. /eval/run/async requires Redis; the synchronous /eval/run and the worker's importable task helper work with no broker.
  4. No frontend. By design — the API exposes all data a dashboard would render.

Roadmap

See docs/roadmap.md and docs/EXECUTION_PLAN.md. Highlights: LLM-as-judge groundedness, hybrid (BM25 + vector) retrieval, MRR/NDCG metrics, per-provider cost dashboards, and export of eval runs to the broader portfolio.

License

MIT — see LICENSE.

About

RAG evaluation framework: hit-rate, MRR, faithfulness scoring, and async batch evaluation with golden question datasets

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors