Data Quality Check Framework for Clinical and Analytical Data
pyCoreGage is a configuration-driven Python package for running domain-level data quality checks and consolidating findings into structured Excel reports with role-based feedback routing. It is the Python port of the R package rCoreGage.
- Why pyCoreGage
- Architecture β Two Layers
- Installation
- Quick Start
- Project Structure
- rule_registry.xlsx β Check Definitions
- How It Works
- Writing Check Scripts
- API Reference
- Console Output Reference
- Publishing to PyPI
Clinical data quality checking typically involves:
- Running the same checks across dozens of domains (AE, LB, CM, VS β¦)
- Separating trial-specific rules from study-wide rules
- Routing findings to different roles (DM, MW, SDTM, ADaM) and tracking responses
- Carrying reviewer notes forward across repeated runs without losing history
| Problem | pyCoreGage solution |
|---|---|
| Engine scattered across trials | Engine installed once via pip install pyCoreGage |
| Hard-coded paths | Single project_config.py per project |
| No role separation in reports | Four separate report channels (DM / MW / SDTM / ADAM) |
| Feedback lost between runs | Structured feedback folders merged on every re-run |
| R-only tooling | Pure Python β works anywhere Python runs |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 1 β pyCoreGage PACKAGE (pip install once) β
β β
β pyCoreGage/ β
β setup.py setup_coregage() reads rule_registry β
β runner.py run_checks() loops and executes scriptsβ
β reporter.py build_reports() merges feedback + xlsx β
β collector.py collect_findings() appends findings to stateβ
β counter.py count_valid() counts observations β
β project.py create_project() scaffolds new project β
β utils.py load_inputs() reads domain data files β
β state.py CoreGageState mutable run state β
β _cli.py pycoregage CLI command-line entry point β
β β
β pyCoreGage/data/rule_registry.xlsx blank registry templateβ
β pyCoreGage/templates/ project file templates β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
create_project("TRIAL_ABC")
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 2 β USER PROJECT (one folder per trial/study) β
β β
β TRIAL_ABC/ β
β run_coregage.py driver β python run_coregage.py β
β rules/ β
β config/ β
β rule_registry.xlsx check definitions (user fills) β
β project_config.py all paths in one place β
β trial/ trial-level check scripts β
β AE.py LB.py CM.py check_AE(state, cfg) β¦ β
β study/ study-level check scripts β
β AE_study.py β¦ check_AE_study(state, cfg) β¦ β
β inputs/ drop domain CSV / SAS7BDAT here β
β outputs/ β
β reports/ Excel reports written here β
β feedback/ reviewer feedback placed here β
β DM/ MW/ SDTM/ ADAM/ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The key separation: the engine (Layer 1) never changes between trials.
Only check scripts, rule_registry.xlsx, and inputs/ change per trial.
pip install pyCoreGagepip install "pyCoreGage[sas]"git clone https://github.com/ganeshbabunn/pyCoreGage
cd pyCoreGage
pip install -e ".[dev]"| Package | Version | Purpose |
|---|---|---|
pandas |
>= 1.5.0 | Data manipulation in check scripts |
openpyxl |
>= 3.1.0 | Read/write Excel reports |
Optional:
| Package | Purpose |
|---|---|
pyreadstat |
Read .sas7bdat files from inputs/ |
- Python >= 3.9
- Works on Windows, macOS, Linux
# Step 1 β Install (once)
# pip install pyCoreGage
# Step 2 β Create a new project (once per trial)
from pyCoreGage import create_project
create_project(
name = "TRIAL_ABC",
path = "/my/projects",
)
# Step 3 β Fill in rules/config/rule_registry.xlsx
# (Trial sheet + Study sheet β see Section 6)
# Step 4 β Write check scripts in rules/trial/ and rules/study/
# (copy check_template.py and implement your logic)
# Step 5 β Drop domain data files into inputs/
# AE.csv, LB.csv, CM.csv β¦ (CSV or .sas7bdat)
# Step 6 β Run
# python run_coregage.py
# β or β
# pycoregage runOr via CLI:
pycoregage create TRIAL_ABC --path /my/projects
cd /my/projects/TRIAL_ABC
# β¦ fill registry, write checks, drop data β¦
pycoregage runExpected console output:
=== pyCoreGage : Starting Run ===
Project : TRIAL_ABC
>> [setup] Starting CoreGage initialisation ...
Sheet 'Trial' rows : 5
Sheet 'Study' rows : 3
Active: 8 ON / 0 OFF
>> [setup] Initialisation complete.
AE.csv -> domains['ae'] (81 rows)
LB.csv -> domains['lb'] (1003 rows)
>>>>>>>>>>>>>>>>>>>> Executing: AE <<<<<<<<<<<<<<<<<<<<
>> [collector] Appending 5 finding(s) for: AECHK001
>> [collector] Appending 2 finding(s) for: AECHK002
>>>>>>>>>>>>>>>>>>>> Executing: LB <<<<<<<<<<<<<<<<<<<<
>> [collector] Appending 12 finding(s) for: LBCHK001
>> [runner] All checks executed. Total findings: 19
>> [reporter] Starting consolidation ...
-------------------------------------------------------
Feedback summary:
Notes : analyst notes: 0 | reviewer notes: 0
Status : open: 19 | queried: 0 | closed: 0
-------------------------------------------------------
Writing: DM_issues.xlsx
Writing: all_open.xlsx
=== pyCoreGage : Run Complete ===
>> Reports written to: /my/projects/TRIAL_ABC/outputs/reports
After create_project(), your folder contains:
TRIAL_ABC/
βββ run_coregage.py β run this
βββ .gitignore
βββ rules/
β βββ config/
β β βββ rule_registry.xlsx β fill in check definitions
β β βββ project_config.py β edit paths if you move the folder
β βββ trial/
β β βββ check_template.py β copy and rename for each domain
β βββ study/
βββ inputs/ β drop AE.csv, LB.csv β¦ here
βββ outputs/
βββ reports/ β Excel reports appear here
βββ feedback/
βββ DM/ β DM reviewer places updated files here
βββ MW/
βββ SDTM/
βββ ADAM/
The registry has two sheets: Trial and Study.
| Column | Type | Description |
|---|---|---|
Category |
str | Domain category (e.g. "Adverse Events") |
Subcategory |
str | Sub-category (e.g. "Date Checks") |
ID |
str | Unique check identifier (e.g. AECHK001) |
Active |
Yes/No | Set to No to disable without deleting |
DM_Report |
Yes/No | Include in DM report |
MW_Report |
Yes/No | Include in MW report |
SDTM_Report |
Yes/No | Include in SDTM report |
ADAM_Report |
Yes/No | Include in ADaM report |
Rule_Set |
str | Script name without .py (e.g. AE) |
Description |
str | Human-readable check description |
Notes |
str | Implementation notes |
Rule_Set determines which script is sourced. If Rule_Set = "AE",
the engine loads rules/trial/AE.py and calls check_AE(state, cfg).
For the Study sheet it loads from rules/study/.
setup_coregage(cfg)
β
βΌ reads rule_registry.xlsx β CoreGageState
load_inputs(cfg)
β
βΌ reads inputs/*.csv β state.domains{"ae": df, "lb": df, β¦}
run_checks(cfg, state)
β
ββ for each active Rule_Set in Trial sheet:
β import rules/trial/{Rule_Set}.py
β call check_{Rule_Set}(state, cfg)
β ββ collect_findings(state, df, id="AECHK001")
β
ββ for each active Rule_Set in Study sheet:
import rules/study/{Rule_Set}.py
call check_{Rule_Set}(state, cfg)
build_reports(cfg, state)
β
ββ import previously saved all_open.xlsx + all_closed.xlsx
ββ read feedback from feedback/DM/, MW/, SDTM/, ADAM/
ββ merge: status tracking, auto-close, re-open
ββ write: DM_issues, MW_issues, SDTM_issues, ADAM_issues,
all_open, all_closed
- New findings β status
open - Findings that disappear from data β auto-closed with tag
[auto-closed β finding no longer present] - Findings closed by reviewer β permanently closed
- Findings re-appearing after analyst closure β re-opened with tag
[Was closed but re-appeared]
# rules/trial/AE.py
import pandas as pd
from pyCoreGage import collect_findings
def check_AE(state, cfg):
ae = state.domains.get("ae")
if ae is None or ae.empty:
return state
active_rules = state.active_rules
if active_rules.get("AECHK001"):
sub = ae.copy()
sub["st"] = pd.to_datetime(sub["AESTDTC"], errors="coerce")
sub["en"] = pd.to_datetime(sub["AEENDTC"], errors="coerce")
result = sub[sub["en"].notna() & sub["st"].notna() & (sub["en"] < sub["st"])].copy()
result["subj_id"] = result["USUBJID"]
result["vis_id"] = float("nan")
result["description"] = (
"End (" + result["en"].dt.strftime("%d%b%Y") +
") before start (" + result["st"].dt.strftime("%d%b%Y") +
") for: " + result["AETERM"]
)
state = collect_findings(
state,
result[["subj_id", "vis_id", "description"]],
id="AECHK001",
)
return state# rules/study/DM_study.py
import pandas as pd
from pyCoreGage import collect_findings
def check_DM_study(state, cfg):
ae = state.domains.get("ae")
dm = state.domains.get("dm")
active_rules = state.active_rules
if active_rules.get("DMPRJ001") and ae is not None and dm is not None:
ae_subjects = set(ae["USUBJID"].dropna())
dm_subjects = set(dm["USUBJID"].dropna())
missing = ae_subjects - dm_subjects
if missing:
result = pd.DataFrame({
"subj_id": list(missing),
"vis_id": [float("nan")] * len(missing),
"description": [f"Subject {s} has AE but no DM record" for s in missing],
})
state = collect_findings(state, result, id="DMPRJ001")
return state- The file must be named
{Rule_Set}.pyβ e.g.AE.pyforRule_Set = "AE" - The function must be named
check_{Rule_Set}(state, cfg)β e.g.check_AE - Always return
stateat the end - Call
collect_findings()once per check ID - The findings DataFrame must have columns:
subj_id,vis_id,description
Reads rule_registry.xlsx, builds the active-rules switch dict, returns
a fresh CoreGageState.
Reads all .csv (and optionally .sas7bdat) files from cfg.inputs.
Returns a dict keyed by lowercase filename stem: {"ae": df, "lb": df}.
Iterates active rule sets, dynamically imports each check script,
calls check_{Rule_Set}(state, cfg).
collect_findings(state, df, id, desc_col="description", sobs=True, unblind_codes=None) β CoreGageState
Validates and appends a findings DataFrame to state.issues.
| Parameter | Type | Description |
|---|---|---|
state |
CoreGageState |
Current run state |
df |
DataFrame |
Findings with subj_id, vis_id, description |
id |
str |
Check ID matching registry |
desc_col |
str |
Alternate description column name |
sobs |
bool |
Flag for subject-observation limiting |
unblind_codes |
list[str] |
Topic codes for unblinding protection |
Returns row count, optionally excluding unblinding-risk rows.
Merges saved issues + feedback, writes six Excel reports to cfg.reports.
Scaffolds a complete project folder. Returns the project root path.
from pyCoreGage import CoreGageConfig
cfg = CoreGageConfig(
project_name = "TRIAL_ABC",
rule_registry = "/path/to/rules/config/rule_registry.xlsx",
trial_checks = "/path/to/rules/trial",
study_checks = "/path/to/rules/study",
inputs = "/path/to/inputs",
reports = "/path/to/outputs/reports",
feedback = "/path/to/outputs/feedback",
)| Message | Meaning |
|---|---|
Active: 8 ON / 0 OFF |
8 checks enabled, 0 disabled |
AE.csv -> domains['ae'] (81 rows) |
AE domain loaded with 81 rows |
>> [collector] Appending 5 finding(s) for: AECHK001 |
5 findings collected |
WARNING: Check script not found: AE.py -- skipping. |
Script missing β check Rule_Set in registry |
ERROR in rule set AE: β¦ |
Exception in check script β other checks continue |
[auto-closed β finding no longer present] |
Finding disappeared from data on re-run |
[Was closed but re-appeared] |
Previously closed finding is back in data |
GPL-3.0-or-later Β© Ganesh Babu G
pyCoreGage: Data Quality Check Framework for Clinical and Analytical Data.
https://github.com/ganeshbabunn/pyCoreGage