Skip to content

yuan557/DEMySTIFI_cohort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clinical ETL Pipeline for Longitudinal Neurology Cohort

A multi-source clinical ETL pipeline for a longitudinal neurology cohort. Patients are modeled as Python objects with typed clinical sub-objects, validated through Pydantic schemas, and persisted via a SQLAlchemy ORM into a structured SQLite database.


Background

Managing clinical research data across a neurology cohort means dealing with multiple heterogeneous sources: hospital EMR exports, manual clinical data entry files, legacy archives, pharmaceutical infusion registries, and biospecimen banks, none of which share a common patient identifier or data format.

Clinical research data requirements often extend beyond what the hospital EMR captures. The pipeline therefore ingests from two parallel tracks: automated EMR exports for structured clinical records, and manual data entry files where research staff record fields outside the scope of the EMR.

This pipeline was built to consolidate those sources into a single, clean, queryable patient database. Consent status is validated before any data is touched, identity resolution across systems is handled via fuzzy matching, and every record carries full provenance back to its source file.


Pipeline

Centralized consent registry (encrypted)
    └── Step 1: Initialize cohort — consent filtering, Patient object creation
    └── Step 2: Load clinical data entry files (per-visit Excel from clinical research staff)
    └── Step 3: Load hospital EMR exports (proprietary XML schema)
    └── Step 4: Load legacy archive (pre-cutoff CSV data dump)
    └── Step 5: Load biospecimen registry
    └── Step 6: Apply DMT corrections (nurse log + pharmaceutical infusion registry)
            └── QC: Harmonize diagnoses, treatment names, date formats
                    └── Export → CSV files or SQLite database

Data Model

Each patient is represented as an aggregate root, a Patient object that owns typed collections of clinical sub-objects:

Patient
  ├── visits[]       — functional scores, diagnosis, clinical assessments
  ├── dmts[]         — disease-modifying therapy history
  ├── relapses[]     — relapse events and steroid use
  ├── mris[]         — MRI scan records
  ├── biomarkers[]   — lab results (MOG, AQP4, Oligoclonal bands)
  ├── biospecimens[] — collected biological samples
  ├── pmh[]          — past medical history
  └── fh[]           — family history

This structure was chosen over flat tables because longitudinal clinical data is inherently hierarchical; a patient has many visits, each visit may have many associated treatments/relapses/functional assessments, and reasoning about a patient's full history is the primary analytical task.


Key Design Decisions

Consent-first filtering Enrollment and consent status are maintained in a centralized encrypted registry. Patients are excluded from the pipeline at initialization if they lack a signed informed consent form or have withdrawn. No clinical data is loaded for non-consenting patients.

Fuzzy identity resolution Patients appear under different identifiers across source systems. A fuzzy matching layer (rapidfuzz) combined with secondary ID verification resolves identities across systems that share no common key.

Pydantic validation before DB insert Each clinical sub-object is validated and type-coerced through a Pydantic schema before being written to the database. Invalid records are logged and skipped rather than crashing the pipeline.

Provenance on every record Every Visit, DMT, MRI, Relapse, and other sub-object carries a source field recording which file it came from. This makes it possible to trace any record back to its origin for auditing or correction.

XML over FHIR The hospital EMR exported data in a proprietary XML schema. FHIR adoption was not yet in place at the time of this pipeline's development. utils/xml_fieldmap.py normalizes those legacy field names to internal standards used across all ETL modules.


Tech Stack

Purpose Library
Data manipulation pandas
Encrypted file ingestion msoffcrypto
Fuzzy patient matching rapidfuzz
Schema validation pydantic
Database ORM + schema sqlalchemy
Database SQLite

Analytical Queries

Five pre-built queries in db/queries.py:

  • Functional score progression : disability scores over time per patient
  • Diagnosis breakdown : patient count by neurological subtype
  • Treatment switching : therapy changes with time-to-switch
  • Relapse rate by treatment : relapses during each active therapy
  • Biospecimen coverage : patients with both clinical and biospecimen data

Setup

pip install -r requirements.txt
cp env.example .env   # fill in your paths and credentials
python main.py        # run full pipeline CSV export

To load into SQLite instead of CSV:

from db.load_to_db import load_to_db
from main import main

p_list = main()
load_to_db(p_list, db_path="db/cohort.db")

Privacy

This repository contains the pipeline scaffold only. No patient data, clinical records, or credentials are included. All paths and file names are configured via environment variables (see env.example). PII fields are read transiently during identity matching and are never stored on the patient model or written to any output.

About

Clinical data integration pipeline for a longitudinal neurology cohort with multi-source ETL, consent-gated ingestion, and structured patient modeling

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages