Integrative Multi-Omics and Machine Learning Analysis Reveals Shared Immunometabolic Signatures Between Type 2 Diabetes and Chronic Lymphocytic Leukemia

Authors

Md. Yeakub Ali¹, Md Shahjada Sajid², Md. Yousuf¹, AKM Azad³

Affiliations

¹ Department of Biomedical Engineering, Islamic University, Kushtia, Bangladesh
² Department of Information and Communication Technology, Islamic University, Kushtia, Bangladesh
³ Department of Mathematics and Statistics, Faculty of Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia

Abstract

Background: Type 2 diabetes (T2D) and chronic lymphocytic leukemia (CLL) are distinct disorders but share emerging links through immunometabolic dysregulation. Although inflammation and metabolic stress in T2D have been associated with cancer progression, the shared molecular basis remains unclear. A systematic cross-disease computational framework integrating multi-omics data and machine learning is still lacking. Methods: We developed an integrative multi-omics and machine learning framework to identify shared molecular signatures between T2D and CLL using transcriptomic datasets (GSE159984 and GSE70830). Differential expression, functional enrichment, and protein–protein interaction network analyses were combined with ensemble feature selection (LASSO, SVM-RFE, Random Forest, and stability resampling) to identify robust candidate biomarkers. Predictive models were optimized using Bayesian optimization and evaluated through cross-validation and independent external validation (GSE92724). Multi-level in silico validation, including immune infiltration, proteomics profiling, drug sensitivity analysis, and molecular docking, was performed to assess biological relevance. Results: A total of 6,700 differentially expressed genes were identified, including 161 shared genes enriched in immune and inflammatory pathways. Network analysis identified 14 hub genes, from which five candidate biomarkers (IL1B, CD3D, CXCL10, CXCR6, and IL10) were selected with stability scores of 60–100%. The ExtraTrees model achieved the best performance (accuracy = 0.818, ROC–AUC = 0.875), with consistent results in the external dataset (GSE92724). In silico analyses supported immune involvement, tissue-relevant expression, and potential therapeutic interactions. Conclusions: This study presents a computational framework identifying shared immunometabolic features between T2D and CLL. The findings highlight candidate biomarkers supported by external and multi-level in silico validation, providing a basis for further experimental investigation.

Overview

This repository contains the computational workflow accompanying the research study on a shared immunometabolic axis between Type 2 Diabetes (T2D) and Chronic Lymphocytic Leukemia (CLL). The framework combines transcriptomic data harmonization, cross-cohort integration, feature selection, machine learning model development, and biomarker prioritization.

The primary objective is to identify robust, biologically interpretable gene signatures that are shared across T2D and CLL and to evaluate their predictive utility using reproducible machine learning pipelines.

Key Features

Multi-cohort preprocessing and harmonization of T2D and CLL gene expression datasets.
AI-driven feature selection for identification of stable cross-disease biomarkers.
Comparative machine learning modeling with cross-validation and external testing.
Structured results export for downstream statistical analysis and figure generation.
Notebook-based workflow designed for transparency and reproducibility.

Installation

1. Clone the repository

git clone https://github.com/sazidshahjada/T2D-CLL-Biomarker-ML.git
cd T2D-CLL-Biomarker-ML

2. Create and activate a Python environment

conda create -n mlenv python=3.10 -y
conda activate mlenv

3. Install dependencies

Install core scientific and machine learning libraries used by the notebooks.

pip install -r requirements.txt

If additional packages are required in your local environment, install them as needed based on notebook imports.

Dataset

The analysis uses publicly available transcriptomic datasets for T2D and CLL.

T2D cohorts are stored under Raw_Data/T2D/.
CLL cohort is stored under Raw_Data/CLL/.
Harmonized outputs are stored under Clean_Data/ and ML_Data/.

Data preprocessing expectations

Ensure raw count/expression files and metadata files remain in their expected directories.
Confirm sample identifiers match between expression matrices and metadata tables.
Run preprocessing notebook first to generate cleaned and harmonized intermediate files.

Usage

Run the workflow in the following order.

1. Launch Jupyter

jupyter notebook

2. Execute notebooks sequentially

01_Data_Preprocessing_and_Harmonization.ipynb
02_Feature_Selection_and_Importance_Analysis.ipynb
03_Model_Comparison_and_Performance_Metrics.ipynb

3. Review generated outputs

Processed datasets: Clean_Data/, ML_Data/
Model artifacts: Saved_Models/
Benchmark and evaluation metrics: Results/
Visual outputs: Figures/

Pipeline Description

1. Data preprocessing

Ingestion of raw transcriptomic matrices and sample metadata.
Quality checks, normalization handling, and cohort harmonization.
Construction of analysis-ready matrices for downstream modeling.

2. Multi-omics integration

Integration of disease-relevant molecular signals into a unified analysis matrix.
Standardization and alignment of features across cohorts/studies.

3. Feature selection

Identification of discriminative and stable features via statistical/ML-based methods.
Ranking of candidate biomarkers by importance and stability criteria.

4. Machine learning modeling

Training and comparison of multiple classification models.
Cross-validation and hyperparameter optimization.
External dataset validation when available.

5. Biomarker identification

Extraction of shared high-priority genes associated with both T2D and CLL.
Compilation of candidate biomarkers in structured results files.

6. Evaluation

Performance metrics across CV, test, and external test settings.
Model-level comparison and robustness assessment.

Reproducibility

Use the same notebook execution order defined in the Usage section.
Keep file paths and folder names unchanged.
Run analyses in a clean environment and record package versions.
Fix random seeds where available in the notebooks to reduce run-to-run variability.
Store generated outputs under existing project directories for traceability.

Results

This section should summarize key figures, tables, and quantitative findings from the paper.

Suggested additions:

Main model performance table (CV/test/external).
Top-ranked shared biomarkers table.
Biological interpretation figures (pathway/network-level).
Sensitivity/specificity and calibration plots.

Configuration

Configuration is currently notebook-driven.

Update dataset paths only if your local directory layout differs.
Adjust model parameters inside notebook cells dedicated to tuning.
Keep outputs directed to Results/, Figures/, and Saved_Models/ for consistency.

For long-term maintainability, consider adding a centralized config.yaml file in future revisions.

Contributing

Contributions are welcome for method improvements, code optimization, documentation refinement, and extended validation analyses.

Fork the repository.
Create a feature branch.
Commit clear, well-documented changes.
Submit a pull request describing the motivation and validation.

License

This project is distributed under the terms specified in the LICENSE file.

Acknowledgements

We acknowledge the public repositories and research consortia that generated and shared the transcriptomic datasets used in this study, as well as the open-source scientific Python ecosystem that enabled this analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
Clean_Data/T2D		Clean_Data/T2D
Figures		Figures
ML_Data		ML_Data
Raw_Data		Raw_Data
Results		Results
.gitignore		.gitignore
01_Data_Preprocessing_and_Harmonization.ipynb		01_Data_Preprocessing_and_Harmonization.ipynb
02_Feature_Selection_and_Importance_Analysis.ipynb		02_Feature_Selection_and_Importance_Analysis.ipynb
03_Model_Comparison_and_Performance_Metrics.ipynb		03_Model_Comparison_and_Performance_Metrics.ipynb
GSE_DEGs_standard.R		GSE_DEGs_standard.R
LICENSE		LICENSE
README.md		README.md
gene_utils.py		gene_utils.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Integrative Multi-Omics and Machine Learning Analysis Reveals Shared Immunometabolic Signatures Between Type 2 Diabetes and Chronic Lymphocytic Leukemia

Authors

Affiliations

Abstract

Overview

Key Features

Installation

1. Clone the repository

2. Create and activate a Python environment

3. Install dependencies

Dataset

Data preprocessing expectations

Usage

1. Launch Jupyter

2. Execute notebooks sequentially

3. Review generated outputs

Pipeline Description

1. Data preprocessing

2. Multi-omics integration

3. Feature selection

4. Machine learning modeling

5. Biomarker identification

6. Evaluation

Reproducibility

Results

Configuration

Contributing

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages