CNV Pipeline — MMRF CoMMpass (GDC)

A reproducible Jupyter notebook pipeline for processing and downstream analysis of Copy Number Variation (CNV) segments from the MMRF CoMMpass cohort via the NCI Genomic Data Commons (GDC).

This repository is designed for transparent, auditable, and publication-oriented CNV analysis. The workflow covers data acquisition, metadata harmonization, QC, CNV classification, recurrence analyses, clinical integration, survival modeling, exploratory clustering, and optional predictive modeling.

What the notebook does

The main notebook implements the following stages:

Environment setup and dependency loading
Run configuration and provenance capture
GDC file discovery and metadata harmonization
Download with integrity verification
CNV loading, normalization, and QC
Classification of CNV states from Segment_Mean
High-confidence somatic CNV filtering
Recurrence analysis
- exact SegmentID (chr:start-end)
- cytoband overlap
- fixed 1 Mb genomic bins
Breakpoint and hotspot feature engineering
Clinical and follow-up integration
Association and survival analyses
- Mann–Whitney tests
- Kaplan–Meier curves
- Cox proportional hazards models
Optional clustering
Optional predictive modeling
Export of publication-ready tables, figures, logs, inventories, and backup archives

Repository contents

.
├── CNV_MMRF_COMMPASS_V17_Rodado.ipynb
├── README.md
├── requirements.txt
├── LICENSE
├── CITATION.cff
└── docs/
    ├── PIPELINE_OVERVIEW.md
    └── NOTEBOOK_MAP.md

Inputs

Typical inputs include:

open-access CNV segment files from the GDC MMRF-COMMPASS project
clinical.tsv
follow_up.tsv
optional tables such as:
- family_history.tsv
- exposure.tsv
- pathology_detail.tsv
hg38 cytoband annotation table (cytoBand_hg38.tsv) for cytoband recurrence summaries

Some MMRF CoMMpass resources may be controlled-access depending on source and release. Ensure your usage complies with the terms of the originating data source.

Outputs

The notebook writes all run-specific artifacts under:

outputs/run_<RUN_ID>/
├── raw/
├── processed/
├── results/
└── logs/

Common outputs generated by the workflow include:

Processed data

processed/file_metadata.tsv
processed/combined_cnvs.txt
processed/cnv_segments.parquet
processed/combined_cnvs_classified_adjusted.tsv
processed/combined_cnvs_filtrado_somatica.txt

Recurrence summaries

results/cnv_recurrence_by_segmentid_exact__by_patient.tsv
results/cnv_recurrence_by_segmentid_exact__by_rows.tsv
results/cnv_recurrence_by_cytoband.tsv
results/cnv_recurrence_by_bins_1Mb.tsv
results/cnv_patient_bin_overlaps_1Mb.tsv
results/cnv_patient_cytoband_overlaps.tsv

Feature engineering

results/breakpoint_metrics_per_patient.tsv
results/hotspot_metrics_per_patient.tsv
results/breakpoints_by_chromosome.tsv
results/breakpoints_by_cytoband.tsv
results/breakpoints_top_bins_1Mb_top100.tsv
results/hot_bins_1Mb_top200.tsv

Clinical and survival

results/os_df_patient_level.tsv
results/survival_features_merged.tsv
results/cox_results_breakpoints_hotspots.tsv
results/top{TOP_K}_recurrent_regions_cox_table.tsv
results/top{TOP_K}_recurrent_segments_cox_table.tsv
results/top{TOP_K}_km_summary.tsv
results/top{TOP_K}_km_summary_segment_exact.tsv

Supplementary / exploratory

results/supp_table_sex_descriptives_mean_sd_median_iqr.tsv
results/supp_table_sex_mannwhitney.tsv
results/supp_table_sex_combined.tsv
results/cluster_assignments.tsv
results/cluster_survival_logrank.txt

Run provenance and archival

logs/run_params.json
final text inventory of generated files
workspace backup zip

Quickstart (local)

python -m venv .venv

# Linux / macOS
source .venv/bin/activate

# Windows PowerShell
# .venv\Scripts\Activate.ps1

pip install -r requirements.txt
pip install jupyterlab
jupyter lab

Then open the notebook and run it from top to bottom.

Quickstart (Google Colab)

Upload the notebook to Colab.
Run the first dependency cell.
If your clinical tables are stored in Google Drive, mount Drive and point the notebook to the corresponding .tsv files.
Execute the notebook sequentially from top to bottom.

Analysis principles

This pipeline was structured with a few explicit choices:

Participant linkage through metadata joins, not filename guessing.
One canonical CNV dataframe for downstream analysis.
No synthetic fallback data inserted into the workflow.
No forced NA → 0 conversion in the survival block.
Multiple recurrence representations to balance strict comparability and biological stability.
Export of intermediate tables to support auditing and manuscript preparation.

Notes and limitations

Exact-breakpoint recurrence (SegmentID = chr:start-end) is useful for strict comparison, but it may underestimate biological recurrence because patient breakpoints rarely match exactly.
Cytoband and fixed-bin recurrence are generally more stable for cohort-level interpretation.
Purity/ploidy adjustment may not be available when working from already-called public CNV segments.
Some downstream blocks are optional and should be interpreted as exploratory unless independently validated.

Citation

If you use this repository in academic work, cite the software record in CITATION.cff and describe the notebook version used in your Methods section.

License

MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CNV Pipeline — MMRF CoMMpass (GDC)

What the notebook does

Repository contents

Inputs

Outputs

Processed data

Recurrence summaries

Feature engineering

Clinical and survival

Supplementary / exploratory

Run provenance and archival

Quickstart (local)

Quickstart (Google Colab)

Analysis principles

Notes and limitations

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CNV_MMRF_COMMPASS_V17_Rodado.ipynb		CNV_MMRF_COMMPASS_V17_Rodado.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CNV Pipeline — MMRF CoMMpass (GDC)

What the notebook does

Repository contents

Inputs

Outputs

Processed data

Recurrence summaries

Feature engineering

Clinical and survival

Supplementary / exploratory

Run provenance and archival

Quickstart (local)

Quickstart (Google Colab)

Analysis principles

Notes and limitations

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages