CSRD-Lake

End-to-end CSRD/ESRS data pipeline reference implementation — Snowflake (validated) + DuckDB local + dbt star schema + Airflow orchestration + Claude/Mistral GenAI extraction with page-level audit lineage.

Same architectural pattern Capgemini's Sustainability Data Hub and PwC's ESG Reporting Manager deploy at French banks today — built solo as an open-source reference implementation.

Live dashboard: csrd-lake.vercel.app · Source: github.com/soneeee22000/csrd-lake · Project context: docs/PROJECT_CONTEXT.md

Live dashboard

Four pages, all server-rendered, all pre-built at deploy time:

Page	Live URL	Screenshot
Landing — CSRD context, live warehouse stats, six-stage architecture, honest "real vs stub" callout	csrd-lake.vercel.app	home.png
Companies — 3 of 10 manifest companies with extracted profiles + pending-ingestion footnote	/companies	—
Per-company ESG profile — metrics grouped by ESRS topic, confidence routing badges, source citations	/company/MC.PA · /company/SU.PA · /company/TTE.PA	company-lvmh.png
Portfolio rollup — synthetic 10-corporate loan-book over real Scope 1 emissions where extracted	/portfolio	portfolio.png

What this is

CSRD-Lake is a working reference implementation of the data architecture French banks (BNP Paribas, Société Générale, Crédit Agricole, BPCE) and the Big-4 practices supporting them ship for CSRD wave-1 reporting in 2026.

It ingests corporate sustainability PDFs from CAC 40 issuers, extracts ESRS metrics from a 19-entry catalog covering 5 topics (E1 climate / E2 pollution / E3 water / S1 workforce / G1 governance) using Claude Sonnet (with Mistral Large as fallback), validates each extraction against Pydantic schemas with confidence scoring, and lands the results in a Snowflake star schema (validated end-to-end) modelled with dbt — a DuckDB local target compiles the same models for laptop development with no cloud signup.

Every extracted metric carries:

A confidence score in [0.0, 1.0] — values below 0.80 route automatically to a human-review queue (mart_disclosure_review_queue) instead of the published mart.
A source citation — page number + verbatim snippet from the source PDF, enforced as a hard requirement of the extraction schema.
A language tag (fr or en in v1) verified against the actual PDF content.
A model tag (claude-sonnet-4-6 or mistral-large-latest) for per-model accuracy reporting.

What this is NOT

❌ Not a product replacing MSCI / Sustainalytics / Bloomberg ESG / Briink / any other ESG data vendor. Those vendors sell curated ratings, multi-thousand-issuer coverage, and analyst time. CSRD-Lake is a portfolio piece demonstrating the architectural pattern, not a competing product.
❌ Not a real bank loan book. The /portfolio rollup demo uses a clearly-labelled synthetic 10-corporate exposure map composed over real Scope 1 emissions where they have been extracted.
❌ Not a complete CSRD compliance solution. Single framework only (CSRD/ESRS); no TCFD/ISSB/Pillar 3 cross-walks in v1.
❌ Not production-grade authentication, RBAC, or SOC 2 controls. Out of scope for a portfolio piece.

Why this exists

In 2024 the EU passed the Corporate Sustainability Reporting Directive (CSRD). From wave 1 (FY2024 reports) onward, every large EU public company must publish detailed sustainability disclosures using the European Sustainability Reporting Standards (ESRS) — over 1,100 datapoints across climate, pollution, water, workforce, and governance. The reports come out as 300–700 page PDFs. Investors, banks, regulators, and corporate compliance teams need this data structured and queryable, not as PDFs.

This is the problem Big-4 ESG practices and integrators (Capgemini, Deloitte, PwC, KPMG, EY) are solving for French G-SIBs (BNP Paribas, Société Générale, Crédit Agricole, BPCE) right now under names like "Sustainability Data Hub" (Capgemini), "ESG Reporting Manager" (PwC), and "CSRD 360 Navigator". The pattern is consistent: ingest sustainability PDFs → extract structured ESRS metrics → land in a cloud warehouse → expose via dashboard / API for downstream reporting and analysis.

CSRD-Lake is a working reference implementation of that pattern, end-to-end, on real CAC 40 disclosures. For the full project context — the regulatory background, layer-by-layer architecture walkthrough, honest "real vs stub" status table, and the skills the codebase demonstrates — see docs/PROJECT_CONTEXT.md.

Architecture

flowchart TD
    A["CAC 40 / DAX 40<br/>IR pages"] -->|"httpx + tenacity<br/>idempotent + atomic"| B["Raw PDFs on disk"]
    B -->|"Airflow ingest_pdfs DAG<br/>mapped per company"| C["Pydantic ESRSMetric<br/>schemas"]
    C -->|"Claude Sonnet 4.6<br/>tool-use API"| D{"Extraction<br/>valid?"}
    D -->|"no"| F["Mistral Large<br/>fallback"]
    D -->|"yes"| E["Per-metric confidence<br/>+ source citation"]
    F --> E
    E -->|"warehouse loader<br/>parameterized executemany"| G[("Snowflake<br/>RAW.DISCLOSURE_EXTRACTED")]
    G -->|"dbt staging view<br/>dedupe on natural key"| H[("staging.stg_disclosure")]
    H -->|"dbt model<br/>joins to dimensions"| I[("marts.fact_disclosure")]
    I -->|"confidence &ge; 0.80"| K[("mart_disclosure_published<br/>dashboard backend")]
    I -->|"confidence &lt; 0.80"| J[("mart_disclosure_review_queue<br/>human review surface")]
    K --> L["Next.js 16 dashboard<br/>csrd-lake.vercel.app"]

    classDef published fill:#d1fae5,stroke:#065f46,color:#000
    classDef review fill:#fef3c7,stroke:#92400e,color:#000
    classDef warehouse fill:#dbeafe,stroke:#1e40af,color:#000
    class K published
    class J review
    class G,H,I warehouse

See docs/PRD.md for full requirements, edge-case handling, and quality gates.

Hero metrics

Every numeric claim below has explicit test conditions — verify with the linked commands.

Claim	Test condition	How to verify
168 tests passing at ~91% coverage	pytest with `--cov-fail-under=70`, branch coverage enabled, full suite	`make test`
7 layered modules ingestion / extraction / warehouse / orchestration / dbt staging / dbt marts / dbt reporting marts	One Python package per layer; one Airflow DAG composes them	`tree src/csrd_lake airflow dbt_project/models`
10 CAC 40 companies in manifest, 3 with extracted profiles, 19 ESRS metrics, 5 ESRS topics (E1, E2, E3, S1, G1)	Manifest + catalog mirrored between Python (`prompts.py`) and dbt seeds (`esrs_metrics_seed.csv`); structural test enforces parity	`pytest tests/test_dbt_project_structure.py -k seed`
32 ESRS metrics extracted, 14 published / 18 review-queued	Real Claude + Mistral extractions against LVMH URD 2023, TotalEnergies S&C 2024, Schneider Electric URD 2024 (first explicitly CSRD-compliant) — every value carries a page citation	`python scripts/export_dashboard_data.py --duckdb data/warehouse/csrd_lake.duckdb`
Snowflake currently sources the live dashboard snapshot	Bootstrap script runs DDL + loads metrics via key-pair auth (bypasses MFA), dbt builds `RAW_MARTS.*` against `ULWVOTP-MJ06399`. The committed `dashboard/lib/data/disclosures.json` was exported from Snowflake — the JSON's `warehouse` field records the engine that answered the query. DuckDB target produces a byte-identical export.	`python scripts/bootstrap_snowflake.py --truncate && python scripts/dbt_run.py deps seed run test --target dev && python scripts/export_dashboard_data.py --source snowflake`
3 custom dbt data-quality tests	metric-value-in-source / confidence-in-[0,1] / published-and-review-queue-disjoint	`dbt test --select test_type:data`
No "vendor displacement" claims anywhere in the repo	`/moat-check` killed the "replaces €40-80k MSCI" framing on 2026-04-30 because Briink ships PDF→ESRS extraction at €195/month and MSCI's value is curated ratings, not parsing. Anti-regression test enforces this.	`pytest tests/ -k killed_claims`
Conventional commits, zero AI co-author tags	All commits follow `feat(scope): subject` format; zero "Co-Authored-By: Claude" footers	`git log --format="%H %s"`
96% extraction accuracy on a hand-verified gold set	Pending Weekend 2 hand-verification of 800 datapoints (50 companies × 16 metrics) against published structured ESRS tables. Number will be updated with the actual measured accuracy + per-language + per-topic breakdown.	TBD — verification harness in v2

Tech stack

Layer	Choice	Why
Orchestration	Apache Airflow 2.10 (Docker Compose)	Modern data stack canonical; portable to MWAA / Astronomer / ADF
Extraction LLM	Claude Sonnet 4.6 primary + Mistral Large fallback	Claude's tool-use API for structured outputs; Mistral as cost / FR-language fallback
Schema validation	Pydantic v2	Type-safe LLM output validation at the boundary
Warehouse	Snowflake (live snapshot source, key-pair auth) and DuckDB (local-dev target)	Same dbt models, byte-identical exports — proves warehouse portability under one codebase
Transformations	dbt-core 1.11 with `dbt-snowflake` + `dbt-duckdb` + `dbt_utils`	Lineage + tests + docs; auto-extending `dim_metric` accepts real-world sub-disclosures
Dashboard	Next.js 16 + Tailwind v4 (live at csrd-lake.vercel.app)	Server Components, pre-rendered, design-token discipline
PDF parsing	`pdfplumber` + `pypdf`	Text extraction from sustainability PDFs
HTTP	`httpx` + `tenacity`	Retry-aware async-ready downloader
Tests	`pytest` + dbt tests (`not_null`, `unique`, `accepted_values`, `relationships`, custom data tests)	Pyramid + data-quality
CI	GitHub Actions (lint + mypy + pytest + dbt parse)	Quality gates enforced pre-merge

mindmap
  root((CSRD-Lake))
    Orchestration
      Apache Airflow 2.10
      Docker Compose
      TaskFlow API
      Dynamic task mapping
    Extraction
      Claude Sonnet 4.6
      Mistral Large
      Pydantic v2
      Structured outputs
    Ingestion
      httpx + tenacity
      pdfplumber + pypdf
      TOML manifest
      Magic-byte validation
    Warehouse
      Snowflake validated
      DuckDB local
      Star schema
      Parameterized executemany
      RSA key-pair auth
    Transformation
      dbt-core 1.11
      dbt-snowflake
      dbt-duckdb
      dbt_utils
      Auto-extending dim_metric
      Custom data tests
    Dashboard
      Next.js 16
      React 19
      Tailwind v4
      Server Components
      Static prerender
    Quality
      pytest 168 cases
      ruff + mypy strict
      91% coverage
      Anti-regression tests
    CI/CD
      GitHub Actions
      Vercel
      Conventional commits

Quickstart

Prerequisites: Python 3.12, uv, Anthropic + Mistral API keys. Snowflake credentials are optional — the local DuckDB target is the default for development.

# 1. Clone + install
git clone https://github.com/soneeee22000/csrd-lake.git
cd csrd-lake
make setup       # uv sync --all-extras

# 2. Configure secrets
cp .env.example .env
# edit .env — fill in ANTHROPIC_API_KEY and MISTRAL_API_KEY (Snowflake vars optional)

# 3. Sanity check (no external services needed)
make smoke       # pytest -m smoke -v
make ci          # full lint + mypy + test suite

# 4. Real-LLM extraction smoke test (~$0.10 per run)
#    A pre-downloaded sample is shipped at data/samples/.
make verify-llm PDF=data/samples/lvmh-2024.pdf TICKER=MC.PA TOPIC=E1
# → prints every extracted ESRSMetric with confidence, source citation, routing
# → on Windows without `make` / `uv` on PATH:
#    .venv/Scripts/python.exe -m csrd_lake.extraction.cli --pdf data/samples/lvmh-2024.pdf --ticker MC.PA --topic E1

# 5. Real-data batch — 3 PDFs × 5 ESRS topics → DuckDB warehouse (~$1.50, ~7 min)
#    PDFs are already on disk under data/raw/; the batch CLI runs the LLM
#    fallback chain, streams metrics into DuckDB, archives JSON per company.
make extract-batch    # uv run python -m csrd_lake.extraction.batch ...

# 6. Build the marts (DuckDB local target)
make dbt-build        # cd dbt_project && DBT_TARGET=duckdb uv run dbt deps && seed && run

# 7. Refresh the dashboard JSON snapshot from the warehouse
make export-dashboard # writes dashboard/lib/data/disclosures.json

# Or chain 5-7 in one shot:
make real-data-pipeline

# 8. Validate the same models against a real Snowflake account
#    (requires SNOWFLAKE_PRIVATE_KEY_PATH in .env; see scripts/generate_snowflake_keypair.py)
uv run python scripts/bootstrap_snowflake.py --truncate
uv run python scripts/dbt_run.py deps seed run test --target dev

# 9. Re-export the dashboard snapshot from Snowflake instead of DuckDB.
#    Output is byte-identical (same content, same ORDER BY); only the JSON's
#    `warehouse` field changes — Vercel still serves the static snapshot.
uv run python scripts/export_dashboard_data.py --source snowflake

Project layout

csrd-lake/
├── src/csrd_lake/
│   ├── extraction/          schemas.py · confidence.py · prompts.py · llm.py
│   ├── ingestion/           manifest.py · downloader.py · data/cac40.toml
│   └── warehouse/           ddl.sql · loader.py
├── airflow/dags/
│   └── csrd_lake.py         TaskFlow DAG with ingest / extract / load groups
├── dbt_project/
│   ├── models/staging/      stg_disclosure (view, dedupes on natural key)
│   ├── models/marts/        dim_company, dim_metric, dim_period, fact_disclosure,
│   │                        mart_disclosure_published, mart_disclosure_review_queue
│   ├── tests/               metric-value-in-source, confidence-in-[0,1], disjoint
│   └── seeds/               companies, ESRS metrics, periods
├── tests/                   168 pytest cases (mirrors src/, AST-tests for DAG + dbt + dashboard)
├── scripts/
│   ├── bootstrap_snowflake.py     Run DDL + load extracted metrics into Snowflake (key-pair auth)
│   ├── generate_snowflake_keypair.py  PKCS#8 keypair generator + ALTER USER SQL printer
│   ├── dbt_run.py                 dotenv-loading dbt wrapper (forwards args to dbt)
│   └── export_dashboard_data.py   Query DuckDB → write dashboard/lib/data/disclosures.json
├── dashboard/
│   ├── app/                 Next.js 16 routes — / (landing) · /companies · /company/[ticker] · /portfolio
│   ├── components/          Card / Badge / MetricTable / ConfidenceBadge primitives
│   └── lib/data/disclosures.json  Build-time JSON snapshot the dashboard imports
├── docs/
│   ├── PRD.md               Source of truth for requirements
│   ├── PROJECT_CONTEXT.md   Plain-language guide: why · what · how · honest scope · CV pointers
│   └── PORTABILITY.md       Snowflake↔Synapse, Airflow↔ADF, Claude↔Azure OpenAI
├── docker-compose.yml       Airflow 2.10 + Postgres metadata
├── pyproject.toml           Pinned deps (uv-managed)
└── Makefile                 setup · lint · test · smoke · ci · verify-llm · extract-batch · dbt-build · export-dashboard · real-data-pipeline · demo

Methodology + test conditions

The /moat-check gate (2026-04-30) killed the original "replaces €40-80k MSCI subscription" hero metric on three grounds:

MSCI's value isn't PDF parsing — they sell curated ratings + 11K-issuer coverage + 4,000 datapoints/issuer with analyst curation. CSRD-Lake replicates the cheapest part of MSCI's stack only.
Pricing was off by 5-10× — MSCI's SEC-filed Form ADV 2A pricing band is $5K-$2M/year. Banks at portfolio scale pay $200K-$500K, not €40-80K.
Briink (Berlin) already sells the exact PDF→ESRS extraction at €195/month with fine-tuned LLMs they claim are >30% more accurate than ChatGPT, plus SOC 2 audit trails.

The README, code comments, dbt SQL, and DAG file are anti-regression-tested for the killed phrasing. See tests/test_dag_structure.py::test_does_not_reintroduce_killed_claims and tests/test_dbt_project_structure.py::test_no_killed_claims_in_sql.

The current pitch language is "reference implementation of the Capgemini Sustainability Data Hub / PwC ESG Reporting Manager pattern" — verifiable, defensible, and aligned with the actual French G-SIB consulting market in 2026.

Portability

This pipeline is built on Snowflake + Airflow + Claude. The architectural pattern is portable:

Source choice	Maps to
Snowflake	Synapse Dedicated SQL Pools / Microsoft Fabric Warehouse / BigQuery / Databricks
Airflow	Azure Data Factory / Prefect / AWS Step Functions
Claude API	Azure OpenAI (private) / Bedrock Claude / Vertex AI Claude
dbt-snowflake	dbt-synapse / dbt-fabric / dbt-bigquery / dbt-databricks

See docs/PORTABILITY.md for the detailed mapping with bank-stack-specific notes.

Demo

The animated walkthrough at the top of this README shows home → per-company ESG profile → portfolio rollup. Static screenshots of each page are in dashboard/public/screenshots/.

A 90-second narrated Loom walkthrough showing the live Snowflake schema, dbt lineage graph, and a sample LLM extraction is planned for v1.1 (after the gold-set hand-verification adds defensible per-metric accuracy numbers).

License

MIT — see LICENSE.

Author

Built by Pyae Sone Kyaw (Seon) — Cloud Data Engineer, Data Science, AI. Available for freelance subcontract missions in Paris (auto-entrepreneur, SIRET registered).

LinkedIn: pyae-sone-kyaw
GitHub: @soneeee22000
Portfolio: pseonkyaw.dev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSRD-Lake

Live dashboard

What this is

What this is NOT

Why this exists

Architecture

Hero metrics

Tech stack

Quickstart

Project layout

Methodology + test conditions

Portability

Demo

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
airflow		airflow
dashboard		dashboard
dbt_project		dbt_project
docs		docs
scripts		scripts
src/csrd_lake		src/csrd_lake
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

CSRD-Lake

Live dashboard

What this is

What this is NOT

Why this exists

Architecture

Hero metrics

Tech stack

Quickstart

Project layout

Methodology + test conditions

Portability

Demo

License

Author

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages