Skip to content

tiagochavo87/CNV-MMRF-COMMPASS-PIPELINE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CNV Pipeline — MMRF CoMMpass (GDC)

A reproducible Jupyter notebook pipeline for processing and downstream analysis of Copy Number Variation (CNV) segments from the MMRF CoMMpass cohort via the NCI Genomic Data Commons (GDC).

This repository is designed for transparent, auditable, and publication-oriented CNV analysis. The workflow covers data acquisition, metadata harmonization, QC, CNV classification, recurrence analyses, clinical integration, survival modeling, exploratory clustering, and optional predictive modeling.

What the notebook does

The main notebook implements the following stages:

  1. Environment setup and dependency loading
  2. Run configuration and provenance capture
  3. GDC file discovery and metadata harmonization
  4. Download with integrity verification
  5. CNV loading, normalization, and QC
  6. Classification of CNV states from Segment_Mean
  7. High-confidence somatic CNV filtering
  8. Recurrence analysis
    • exact SegmentID (chr:start-end)
    • cytoband overlap
    • fixed 1 Mb genomic bins
  9. Breakpoint and hotspot feature engineering
  10. Clinical and follow-up integration
  11. Association and survival analyses
    • Mann–Whitney tests
    • Kaplan–Meier curves
    • Cox proportional hazards models
  12. Optional clustering
  13. Optional predictive modeling
  14. Export of publication-ready tables, figures, logs, inventories, and backup archives

Repository contents

.
├── CNV_MMRF_COMMPASS_V17_Rodado.ipynb
├── README.md
├── requirements.txt
├── LICENSE
├── CITATION.cff
└── docs/
    ├── PIPELINE_OVERVIEW.md
    └── NOTEBOOK_MAP.md

Inputs

Typical inputs include:

  • open-access CNV segment files from the GDC MMRF-COMMPASS project
  • clinical.tsv
  • follow_up.tsv
  • optional tables such as:
    • family_history.tsv
    • exposure.tsv
    • pathology_detail.tsv
  • hg38 cytoband annotation table (cytoBand_hg38.tsv) for cytoband recurrence summaries

Some MMRF CoMMpass resources may be controlled-access depending on source and release. Ensure your usage complies with the terms of the originating data source.

Outputs

The notebook writes all run-specific artifacts under:

outputs/run_<RUN_ID>/
├── raw/
├── processed/
├── results/
└── logs/

Common outputs generated by the workflow include:

Processed data

  • processed/file_metadata.tsv
  • processed/combined_cnvs.txt
  • processed/cnv_segments.parquet
  • processed/combined_cnvs_classified_adjusted.tsv
  • processed/combined_cnvs_filtrado_somatica.txt

Recurrence summaries

  • results/cnv_recurrence_by_segmentid_exact__by_patient.tsv
  • results/cnv_recurrence_by_segmentid_exact__by_rows.tsv
  • results/cnv_recurrence_by_cytoband.tsv
  • results/cnv_recurrence_by_bins_1Mb.tsv
  • results/cnv_patient_bin_overlaps_1Mb.tsv
  • results/cnv_patient_cytoband_overlaps.tsv

Feature engineering

  • results/breakpoint_metrics_per_patient.tsv
  • results/hotspot_metrics_per_patient.tsv
  • results/breakpoints_by_chromosome.tsv
  • results/breakpoints_by_cytoband.tsv
  • results/breakpoints_top_bins_1Mb_top100.tsv
  • results/hot_bins_1Mb_top200.tsv

Clinical and survival

  • results/os_df_patient_level.tsv
  • results/survival_features_merged.tsv
  • results/cox_results_breakpoints_hotspots.tsv
  • results/top{TOP_K}_recurrent_regions_cox_table.tsv
  • results/top{TOP_K}_recurrent_segments_cox_table.tsv
  • results/top{TOP_K}_km_summary.tsv
  • results/top{TOP_K}_km_summary_segment_exact.tsv

Supplementary / exploratory

  • results/supp_table_sex_descriptives_mean_sd_median_iqr.tsv
  • results/supp_table_sex_mannwhitney.tsv
  • results/supp_table_sex_combined.tsv
  • results/cluster_assignments.tsv
  • results/cluster_survival_logrank.txt

Run provenance and archival

  • logs/run_params.json
  • final text inventory of generated files
  • workspace backup zip

Quickstart (local)

python -m venv .venv

# Linux / macOS
source .venv/bin/activate

# Windows PowerShell
# .venv\Scripts\Activate.ps1

pip install -r requirements.txt
pip install jupyterlab
jupyter lab

Then open the notebook and run it from top to bottom.

Quickstart (Google Colab)

  1. Upload the notebook to Colab.
  2. Run the first dependency cell.
  3. If your clinical tables are stored in Google Drive, mount Drive and point the notebook to the corresponding .tsv files.
  4. Execute the notebook sequentially from top to bottom.

Analysis principles

This pipeline was structured with a few explicit choices:

  • Participant linkage through metadata joins, not filename guessing.
  • One canonical CNV dataframe for downstream analysis.
  • No synthetic fallback data inserted into the workflow.
  • No forced NA → 0 conversion in the survival block.
  • Multiple recurrence representations to balance strict comparability and biological stability.
  • Export of intermediate tables to support auditing and manuscript preparation.

Notes and limitations

  • Exact-breakpoint recurrence (SegmentID = chr:start-end) is useful for strict comparison, but it may underestimate biological recurrence because patient breakpoints rarely match exactly.
  • Cytoband and fixed-bin recurrence are generally more stable for cohort-level interpretation.
  • Purity/ploidy adjustment may not be available when working from already-called public CNV segments.
  • Some downstream blocks are optional and should be interpreted as exploratory unless independently validated.

Citation

If you use this repository in academic work, cite the software record in CITATION.cff and describe the notebook version used in your Methods section.

License

MIT License.

About

A reproducible Jupyter-based pipeline for CNV processing and downstream analytics in the MMRF CoMMpass multiple myeloma cohort, including QC, feature engineering, visualizations, and survival/staging evaluation.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors