A reproducible Jupyter notebook pipeline for processing and downstream analysis of Copy Number Variation (CNV) segments from the MMRF CoMMpass cohort via the NCI Genomic Data Commons (GDC).
This repository is designed for transparent, auditable, and publication-oriented CNV analysis. The workflow covers data acquisition, metadata harmonization, QC, CNV classification, recurrence analyses, clinical integration, survival modeling, exploratory clustering, and optional predictive modeling.
The main notebook implements the following stages:
- Environment setup and dependency loading
- Run configuration and provenance capture
- GDC file discovery and metadata harmonization
- Download with integrity verification
- CNV loading, normalization, and QC
- Classification of CNV states from
Segment_Mean - High-confidence somatic CNV filtering
- Recurrence analysis
- exact
SegmentID(chr:start-end) - cytoband overlap
- fixed 1 Mb genomic bins
- exact
- Breakpoint and hotspot feature engineering
- Clinical and follow-up integration
- Association and survival analyses
- Mann–Whitney tests
- Kaplan–Meier curves
- Cox proportional hazards models
- Optional clustering
- Optional predictive modeling
- Export of publication-ready tables, figures, logs, inventories, and backup archives
.
├── CNV_MMRF_COMMPASS_V17_Rodado.ipynb
├── README.md
├── requirements.txt
├── LICENSE
├── CITATION.cff
└── docs/
├── PIPELINE_OVERVIEW.md
└── NOTEBOOK_MAP.md
Typical inputs include:
- open-access CNV segment files from the GDC
MMRF-COMMPASSproject clinical.tsvfollow_up.tsv- optional tables such as:
family_history.tsvexposure.tsvpathology_detail.tsv
- hg38 cytoband annotation table (
cytoBand_hg38.tsv) for cytoband recurrence summaries
Some MMRF CoMMpass resources may be controlled-access depending on source and release. Ensure your usage complies with the terms of the originating data source.
The notebook writes all run-specific artifacts under:
outputs/run_<RUN_ID>/
├── raw/
├── processed/
├── results/
└── logs/
Common outputs generated by the workflow include:
processed/file_metadata.tsvprocessed/combined_cnvs.txtprocessed/cnv_segments.parquetprocessed/combined_cnvs_classified_adjusted.tsvprocessed/combined_cnvs_filtrado_somatica.txt
results/cnv_recurrence_by_segmentid_exact__by_patient.tsvresults/cnv_recurrence_by_segmentid_exact__by_rows.tsvresults/cnv_recurrence_by_cytoband.tsvresults/cnv_recurrence_by_bins_1Mb.tsvresults/cnv_patient_bin_overlaps_1Mb.tsvresults/cnv_patient_cytoband_overlaps.tsv
results/breakpoint_metrics_per_patient.tsvresults/hotspot_metrics_per_patient.tsvresults/breakpoints_by_chromosome.tsvresults/breakpoints_by_cytoband.tsvresults/breakpoints_top_bins_1Mb_top100.tsvresults/hot_bins_1Mb_top200.tsv
results/os_df_patient_level.tsvresults/survival_features_merged.tsvresults/cox_results_breakpoints_hotspots.tsvresults/top{TOP_K}_recurrent_regions_cox_table.tsvresults/top{TOP_K}_recurrent_segments_cox_table.tsvresults/top{TOP_K}_km_summary.tsvresults/top{TOP_K}_km_summary_segment_exact.tsv
results/supp_table_sex_descriptives_mean_sd_median_iqr.tsvresults/supp_table_sex_mannwhitney.tsvresults/supp_table_sex_combined.tsvresults/cluster_assignments.tsvresults/cluster_survival_logrank.txt
logs/run_params.json- final text inventory of generated files
- workspace backup zip
python -m venv .venv
# Linux / macOS
source .venv/bin/activate
# Windows PowerShell
# .venv\Scripts\Activate.ps1
pip install -r requirements.txt
pip install jupyterlab
jupyter labThen open the notebook and run it from top to bottom.
- Upload the notebook to Colab.
- Run the first dependency cell.
- If your clinical tables are stored in Google Drive, mount Drive and point the notebook to the corresponding
.tsvfiles. - Execute the notebook sequentially from top to bottom.
This pipeline was structured with a few explicit choices:
- Participant linkage through metadata joins, not filename guessing.
- One canonical CNV dataframe for downstream analysis.
- No synthetic fallback data inserted into the workflow.
- No forced NA → 0 conversion in the survival block.
- Multiple recurrence representations to balance strict comparability and biological stability.
- Export of intermediate tables to support auditing and manuscript preparation.
- Exact-breakpoint recurrence (
SegmentID = chr:start-end) is useful for strict comparison, but it may underestimate biological recurrence because patient breakpoints rarely match exactly. - Cytoband and fixed-bin recurrence are generally more stable for cohort-level interpretation.
- Purity/ploidy adjustment may not be available when working from already-called public CNV segments.
- Some downstream blocks are optional and should be interpreted as exploratory unless independently validated.
If you use this repository in academic work, cite the software record in CITATION.cff and describe the notebook version used in your Methods section.
MIT License.