Clinical ETL Pipeline for Longitudinal Neurology Cohort

A multi-source clinical ETL pipeline for a longitudinal neurology cohort. Patients are modeled as Python objects with typed clinical sub-objects, validated through Pydantic schemas, and persisted via a SQLAlchemy ORM into a structured SQLite database.

Background

Managing clinical research data across a neurology cohort means dealing with multiple heterogeneous sources: hospital EMR exports, manual clinical data entry files, legacy archives, pharmaceutical infusion registries, and biospecimen banks, none of which share a common patient identifier or data format.

Clinical research data requirements often extend beyond what the hospital EMR captures. The pipeline therefore ingests from two parallel tracks: automated EMR exports for structured clinical records, and manual data entry files where research staff record fields outside the scope of the EMR.

This pipeline was built to consolidate those sources into a single, clean, queryable patient database. Consent status is validated before any data is touched, identity resolution across systems is handled via fuzzy matching, and every record carries full provenance back to its source file.

Pipeline

Centralized consent registry (encrypted)
    └── Step 1: Initialize cohort — consent filtering, Patient object creation
    └── Step 2: Load clinical data entry files (per-visit Excel from clinical research staff)
    └── Step 3: Load hospital EMR exports (proprietary XML schema)
    └── Step 4: Load legacy archive (pre-cutoff CSV data dump)
    └── Step 5: Load biospecimen registry
    └── Step 6: Apply DMT corrections (nurse log + pharmaceutical infusion registry)
            └── QC: Harmonize diagnoses, treatment names, date formats
                    └── Export → CSV files or SQLite database

Data Model

Each patient is represented as an aggregate root, a Patient object that owns typed collections of clinical sub-objects:

Patient
  ├── visits[]       — functional scores, diagnosis, clinical assessments
  ├── dmts[]         — disease-modifying therapy history
  ├── relapses[]     — relapse events and steroid use
  ├── mris[]         — MRI scan records
  ├── biomarkers[]   — lab results (MOG, AQP4, Oligoclonal bands)
  ├── biospecimens[] — collected biological samples
  ├── pmh[]          — past medical history
  └── fh[]           — family history

This structure was chosen over flat tables because longitudinal clinical data is inherently hierarchical; a patient has many visits, each visit may have many associated treatments/relapses/functional assessments, and reasoning about a patient's full history is the primary analytical task.

Key Design Decisions

Consent-first filtering Enrollment and consent status are maintained in a centralized encrypted registry. Patients are excluded from the pipeline at initialization if they lack a signed informed consent form or have withdrawn. No clinical data is loaded for non-consenting patients.

Fuzzy identity resolution Patients appear under different identifiers across source systems. A fuzzy matching layer (rapidfuzz) combined with secondary ID verification resolves identities across systems that share no common key.

Pydantic validation before DB insert Each clinical sub-object is validated and type-coerced through a Pydantic schema before being written to the database. Invalid records are logged and skipped rather than crashing the pipeline.

Provenance on every record Every Visit, DMT, MRI, Relapse, and other sub-object carries a source field recording which file it came from. This makes it possible to trace any record back to its origin for auditing or correction.

XML over FHIR The hospital EMR exported data in a proprietary XML schema. FHIR adoption was not yet in place at the time of this pipeline's development. utils/xml_fieldmap.py normalizes those legacy field names to internal standards used across all ETL modules.

Tech Stack

Purpose	Library
Data manipulation	pandas
Encrypted file ingestion	msoffcrypto
Fuzzy patient matching	rapidfuzz
Schema validation	pydantic
Database ORM + schema	sqlalchemy
Database	SQLite

Analytical Queries

Five pre-built queries in db/queries.py:

Functional score progression : disability scores over time per patient
Diagnosis breakdown : patient count by neurological subtype
Treatment switching : therapy changes with time-to-switch
Relapse rate by treatment : relapses during each active therapy
Biospecimen coverage : patients with both clinical and biospecimen data

Setup

pip install -r requirements.txt
cp env.example .env   # fill in your paths and credentials
python main.py        # run full pipeline CSV export

To load into SQLite instead of CSV:

from db.load_to_db import load_to_db
from main import main

p_list = main()
load_to_db(p_list, db_path="db/cohort.db")

Privacy

This repository contains the pipeline scaffold only. No patient data, clinical records, or credentials are included. All paths and file names are configured via environment variables (see env.example). PII fields are read transiently during identity matching and are never stored on the patient model or written to any output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clinical ETL Pipeline for Longitudinal Neurology Cohort

Background

Pipeline

Data Model

Key Design Decisions

Tech Stack

Analytical Queries

Setup

Privacy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ETL		ETL
config		config
correction_tools		correction_tools
db		db
modules		modules
utils		utils
.gitignore		.gitignore
README.md		README.md
env.example		env.example
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Clinical ETL Pipeline for Longitudinal Neurology Cohort

Background

Pipeline

Data Model

Key Design Decisions

Tech Stack

Analytical Queries

Setup

Privacy

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages