Skip to content

ganeshbabuNN/pyCoreGage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

pyCoreGage

version python license tests

Data Quality Check Framework for Clinical and Analytical Data

pyCoreGage is a configuration-driven Python package for running domain-level data quality checks and consolidating findings into structured Excel reports with role-based feedback routing. It is the Python port of the R package rCoreGage.


Table of Contents

  1. Why pyCoreGage
  2. Architecture β€” Two Layers
  3. Installation
  4. Quick Start
  5. Project Structure
  6. rule_registry.xlsx β€” Check Definitions
  7. How It Works
  8. Writing Check Scripts
  9. API Reference
  10. Console Output Reference
  11. Publishing to PyPI

1. Why pyCoreGage

Clinical data quality checking typically involves:

  • Running the same checks across dozens of domains (AE, LB, CM, VS …)
  • Separating trial-specific rules from study-wide rules
  • Routing findings to different roles (DM, MW, SDTM, ADaM) and tracking responses
  • Carrying reviewer notes forward across repeated runs without losing history
Problem pyCoreGage solution
Engine scattered across trials Engine installed once via pip install pyCoreGage
Hard-coded paths Single project_config.py per project
No role separation in reports Four separate report channels (DM / MW / SDTM / ADAM)
Feedback lost between runs Structured feedback folders merged on every re-run
R-only tooling Pure Python β€” works anywhere Python runs

2. Architecture β€” Two Layers

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LAYER 1 β€” pyCoreGage PACKAGE  (pip install once)            β”‚
β”‚                                                              β”‚
β”‚  pyCoreGage/                                                 β”‚
β”‚    setup.py      setup_coregage()   reads rule_registry      β”‚
β”‚    runner.py     run_checks()       loops and executes scriptsβ”‚
β”‚    reporter.py   build_reports()    merges feedback + xlsx   β”‚
β”‚    collector.py  collect_findings() appends findings to stateβ”‚
β”‚    counter.py    count_valid()      counts observations       β”‚
β”‚    project.py    create_project()   scaffolds new project    β”‚
β”‚    utils.py      load_inputs()      reads domain data files  β”‚
β”‚    state.py      CoreGageState      mutable run state        β”‚
β”‚    _cli.py       pycoregage CLI     command-line entry point β”‚
β”‚                                                              β”‚
β”‚  pyCoreGage/data/rule_registry.xlsx   blank registry templateβ”‚
β”‚  pyCoreGage/templates/                project file templates β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
          create_project("TRIAL_ABC")
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LAYER 2 β€” USER PROJECT  (one folder per trial/study)        β”‚
β”‚                                                              β”‚
β”‚  TRIAL_ABC/                                                  β”‚
β”‚    run_coregage.py          driver β€” python run_coregage.py  β”‚
β”‚    rules/                                                    β”‚
β”‚      config/                                                 β”‚
β”‚        rule_registry.xlsx   check definitions (user fills)  β”‚
β”‚        project_config.py    all paths in one place          β”‚
β”‚      trial/                 trial-level check scripts       β”‚
β”‚        AE.py  LB.py  CM.py  check_AE(state, cfg) …         β”‚
β”‚      study/                 study-level check scripts       β”‚
β”‚        AE_study.py …        check_AE_study(state, cfg) …   β”‚
β”‚    inputs/                  drop domain CSV / SAS7BDAT here β”‚
β”‚    outputs/                                                  β”‚
β”‚      reports/               Excel reports written here      β”‚
β”‚      feedback/              reviewer feedback placed here   β”‚
β”‚        DM/  MW/  SDTM/  ADAM/                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The key separation: the engine (Layer 1) never changes between trials. Only check scripts, rule_registry.xlsx, and inputs/ change per trial.


3. Installation

From PyPI (stable)

pip install pyCoreGage

With SAS file support

pip install "pyCoreGage[sas]"

Development install

git clone https://github.com/ganeshbabunn/pyCoreGage
cd pyCoreGage
pip install -e ".[dev]"

Dependencies

Package Version Purpose
pandas >= 1.5.0 Data manipulation in check scripts
openpyxl >= 3.1.0 Read/write Excel reports

Optional:

Package Purpose
pyreadstat Read .sas7bdat files from inputs/

Requirements

  • Python >= 3.9
  • Works on Windows, macOS, Linux

4. Quick Start

# Step 1 β€” Install (once)
# pip install pyCoreGage

# Step 2 β€” Create a new project (once per trial)
from pyCoreGage import create_project

create_project(
    name = "TRIAL_ABC",
    path = "/my/projects",
)

# Step 3 β€” Fill in rules/config/rule_registry.xlsx
#           (Trial sheet + Study sheet β€” see Section 6)

# Step 4 β€” Write check scripts in rules/trial/ and rules/study/
#           (copy check_template.py and implement your logic)

# Step 5 β€” Drop domain data files into inputs/
#           AE.csv, LB.csv, CM.csv …  (CSV or .sas7bdat)

# Step 6 β€” Run
# python run_coregage.py
# β€” or β€”
# pycoregage run

Or via CLI:

pycoregage create TRIAL_ABC --path /my/projects
cd /my/projects/TRIAL_ABC
# … fill registry, write checks, drop data …
pycoregage run

Expected console output:

=== pyCoreGage : Starting Run ===
  Project : TRIAL_ABC
>> [setup] Starting CoreGage initialisation ...
  Sheet 'Trial' rows    : 5
  Sheet 'Study' rows    : 3
  Active: 8 ON  /  0 OFF
>> [setup] Initialisation complete.
   AE.csv -> domains['ae']  (81 rows)
   LB.csv -> domains['lb']  (1003 rows)
>>>>>>>>>>>>>>>>>>>> Executing: AE <<<<<<<<<<<<<<<<<<<<
  >> [collector] Appending 5 finding(s) for: AECHK001
  >> [collector] Appending 2 finding(s) for: AECHK002
>>>>>>>>>>>>>>>>>>>> Executing: LB <<<<<<<<<<<<<<<<<<<<
  >> [collector] Appending 12 finding(s) for: LBCHK001
>> [runner] All checks executed. Total findings: 19
>> [reporter] Starting consolidation ...
  -------------------------------------------------------
  Feedback summary:
    Notes  : analyst notes: 0  |  reviewer notes: 0
    Status : open: 19  |  queried: 0  |  closed: 0
  -------------------------------------------------------
  Writing: DM_issues.xlsx
  Writing: all_open.xlsx
=== pyCoreGage : Run Complete ===
>> Reports written to: /my/projects/TRIAL_ABC/outputs/reports

5. Project Structure

After create_project(), your folder contains:

TRIAL_ABC/
β”œβ”€β”€ run_coregage.py              ← run this
β”œβ”€β”€ .gitignore
β”œβ”€β”€ rules/
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   β”œβ”€β”€ rule_registry.xlsx   ← fill in check definitions
β”‚   β”‚   └── project_config.py    ← edit paths if you move the folder
β”‚   β”œβ”€β”€ trial/
β”‚   β”‚   └── check_template.py    ← copy and rename for each domain
β”‚   └── study/
β”œβ”€β”€ inputs/                      ← drop AE.csv, LB.csv … here
└── outputs/
    β”œβ”€β”€ reports/                 ← Excel reports appear here
    └── feedback/
        β”œβ”€β”€ DM/                  ← DM reviewer places updated files here
        β”œβ”€β”€ MW/
        β”œβ”€β”€ SDTM/
        └── ADAM/

6. rule_registry.xlsx β€” Check Definitions

The registry has two sheets: Trial and Study.

Column Type Description
Category str Domain category (e.g. "Adverse Events")
Subcategory str Sub-category (e.g. "Date Checks")
ID str Unique check identifier (e.g. AECHK001)
Active Yes/No Set to No to disable without deleting
DM_Report Yes/No Include in DM report
MW_Report Yes/No Include in MW report
SDTM_Report Yes/No Include in SDTM report
ADAM_Report Yes/No Include in ADaM report
Rule_Set str Script name without .py (e.g. AE)
Description str Human-readable check description
Notes str Implementation notes

Rule_Set determines which script is sourced. If Rule_Set = "AE", the engine loads rules/trial/AE.py and calls check_AE(state, cfg). For the Study sheet it loads from rules/study/.


7. How It Works

setup_coregage(cfg)
       β”‚
       β–Ό  reads rule_registry.xlsx β†’ CoreGageState
load_inputs(cfg)
       β”‚
       β–Ό  reads inputs/*.csv β†’ state.domains{"ae": df, "lb": df, …}
run_checks(cfg, state)
       β”‚
       β”œβ”€ for each active Rule_Set in Trial sheet:
       β”‚       import rules/trial/{Rule_Set}.py
       β”‚       call check_{Rule_Set}(state, cfg)
       β”‚               └─ collect_findings(state, df, id="AECHK001")
       β”‚
       └─ for each active Rule_Set in Study sheet:
               import rules/study/{Rule_Set}.py
               call check_{Rule_Set}(state, cfg)
build_reports(cfg, state)
       β”‚
       β”œβ”€ import previously saved all_open.xlsx + all_closed.xlsx
       β”œβ”€ read feedback from feedback/DM/, MW/, SDTM/, ADAM/
       β”œβ”€ merge: status tracking, auto-close, re-open
       └─ write: DM_issues, MW_issues, SDTM_issues, ADAM_issues,
                 all_open, all_closed

Smart status management

  • New findings β†’ status open
  • Findings that disappear from data β†’ auto-closed with tag [auto-closed β€” finding no longer present]
  • Findings closed by reviewer β†’ permanently closed
  • Findings re-appearing after analyst closure β†’ re-opened with tag [Was closed but re-appeared]

8. Writing Check Scripts

Minimal example β€” date check (trial level)

# rules/trial/AE.py

import pandas as pd
from pyCoreGage import collect_findings


def check_AE(state, cfg):
    ae = state.domains.get("ae")
    if ae is None or ae.empty:
        return state

    active_rules = state.active_rules

    if active_rules.get("AECHK001"):
        sub = ae.copy()
        sub["st"] = pd.to_datetime(sub["AESTDTC"], errors="coerce")
        sub["en"] = pd.to_datetime(sub["AEENDTC"], errors="coerce")
        result = sub[sub["en"].notna() & sub["st"].notna() & (sub["en"] < sub["st"])].copy()

        result["subj_id"]     = result["USUBJID"]
        result["vis_id"]      = float("nan")
        result["description"] = (
            "End (" + result["en"].dt.strftime("%d%b%Y") +
            ") before start (" + result["st"].dt.strftime("%d%b%Y") +
            ") for: " + result["AETERM"]
        )
        state = collect_findings(
            state,
            result[["subj_id", "vis_id", "description"]],
            id="AECHK001",
        )

    return state

Cross-domain study-level check

# rules/study/DM_study.py

import pandas as pd
from pyCoreGage import collect_findings


def check_DM_study(state, cfg):
    ae = state.domains.get("ae")
    dm = state.domains.get("dm")
    active_rules = state.active_rules

    if active_rules.get("DMPRJ001") and ae is not None and dm is not None:
        ae_subjects = set(ae["USUBJID"].dropna())
        dm_subjects = set(dm["USUBJID"].dropna())
        missing = ae_subjects - dm_subjects

        if missing:
            result = pd.DataFrame({
                "subj_id":     list(missing),
                "vis_id":      [float("nan")] * len(missing),
                "description": [f"Subject {s} has AE but no DM record" for s in missing],
            })
            state = collect_findings(state, result, id="DMPRJ001")

    return state

Rules for check scripts

  1. The file must be named {Rule_Set}.py β€” e.g. AE.py for Rule_Set = "AE"
  2. The function must be named check_{Rule_Set}(state, cfg) β€” e.g. check_AE
  3. Always return state at the end
  4. Call collect_findings() once per check ID
  5. The findings DataFrame must have columns: subj_id, vis_id, description

9. API Reference

setup_coregage(cfg) β†’ CoreGageState

Reads rule_registry.xlsx, builds the active-rules switch dict, returns a fresh CoreGageState.

load_inputs(cfg) β†’ dict[str, DataFrame]

Reads all .csv (and optionally .sas7bdat) files from cfg.inputs. Returns a dict keyed by lowercase filename stem: {"ae": df, "lb": df}.

run_checks(cfg, state) β†’ CoreGageState

Iterates active rule sets, dynamically imports each check script, calls check_{Rule_Set}(state, cfg).

collect_findings(state, df, id, desc_col="description", sobs=True, unblind_codes=None) β†’ CoreGageState

Validates and appends a findings DataFrame to state.issues.

Parameter Type Description
state CoreGageState Current run state
df DataFrame Findings with subj_id, vis_id, description
id str Check ID matching registry
desc_col str Alternate description column name
sobs bool Flag for subject-observation limiting
unblind_codes list[str] Topic codes for unblinding protection

count_valid(df, unblind_codes=None) β†’ int

Returns row count, optionally excluding unblinding-risk rows.

build_reports(cfg, state) β†’ None

Merges saved issues + feedback, writes six Excel reports to cfg.reports.

create_project(name, path, overwrite=False) β†’ str

Scaffolds a complete project folder. Returns the project root path.

CoreGageConfig

from pyCoreGage import CoreGageConfig

cfg = CoreGageConfig(
    project_name  = "TRIAL_ABC",
    rule_registry = "/path/to/rules/config/rule_registry.xlsx",
    trial_checks  = "/path/to/rules/trial",
    study_checks  = "/path/to/rules/study",
    inputs        = "/path/to/inputs",
    reports       = "/path/to/outputs/reports",
    feedback      = "/path/to/outputs/feedback",
)

10. Console Output Reference

Message Meaning
Active: 8 ON / 0 OFF 8 checks enabled, 0 disabled
AE.csv -> domains['ae'] (81 rows) AE domain loaded with 81 rows
>> [collector] Appending 5 finding(s) for: AECHK001 5 findings collected
WARNING: Check script not found: AE.py -- skipping. Script missing β€” check Rule_Set in registry
ERROR in rule set AE: … Exception in check script β€” other checks continue
[auto-closed β€” finding no longer present] Finding disappeared from data on re-run
[Was closed but re-appeared] Previously closed finding is back in data


License

GPL-3.0-or-later Β© Ganesh Babu G

Citation

pyCoreGage: Data Quality Check Framework for Clinical and Analytical Data.
https://github.com/ganeshbabunn/pyCoreGage

About

pyCoreGage provides a high-speed framework for executing complex data checks across clinical domains. By automating the heavy lifting of DM/SDTM/ADaM/DM or any Data role issue tracking, it eliminates manual oversight and reduces data cleaning time. Engineered for large-scale trial data while maintaining strict adherence to data compliance

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages