Integrative Multi-Omics and Machine Learning Analysis Reveals Shared Immunometabolic Signatures Between Type 2 Diabetes and Chronic Lymphocytic Leukemia
Md. Yeakub Ali1, Md Shahjada Sajid2, Md. Yousuf1, AKM Azad3
- 1 Department of Biomedical Engineering, Islamic University, Kushtia, Bangladesh
- 2 Department of Information and Communication Technology, Islamic University, Kushtia, Bangladesh
- 3 Department of Mathematics and Statistics, Faculty of Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia
Background: Type 2 diabetes (T2D) and chronic lymphocytic leukemia (CLL) are distinct disorders but share emerging links through immunometabolic dysregulation. Although inflammation and metabolic stress in T2D have been associated with cancer progression, the shared molecular basis remains unclear. A systematic cross-disease computational framework integrating multi-omics data and machine learning is still lacking. Methods: We developed an integrative multi-omics and machine learning framework to identify shared molecular signatures between T2D and CLL using transcriptomic datasets (GSE159984 and GSE70830). Differential expression, functional enrichment, and protein–protein interaction network analyses were combined with ensemble feature selection (LASSO, SVM-RFE, Random Forest, and stability resampling) to identify robust candidate biomarkers. Predictive models were optimized using Bayesian optimization and evaluated through cross-validation and independent external validation (GSE92724). Multi-level in silico validation, including immune infiltration, proteomics profiling, drug sensitivity analysis, and molecular docking, was performed to assess biological relevance. Results: A total of 6,700 differentially expressed genes were identified, including 161 shared genes enriched in immune and inflammatory pathways. Network analysis identified 14 hub genes, from which five candidate biomarkers (IL1B, CD3D, CXCL10, CXCR6, and IL10) were selected with stability scores of 60–100%. The ExtraTrees model achieved the best performance (accuracy = 0.818, ROC–AUC = 0.875), with consistent results in the external dataset (GSE92724). In silico analyses supported immune involvement, tissue-relevant expression, and potential therapeutic interactions. Conclusions: This study presents a computational framework identifying shared immunometabolic features between T2D and CLL. The findings highlight candidate biomarkers supported by external and multi-level in silico validation, providing a basis for further experimental investigation.
This repository contains the computational workflow accompanying the research study on a shared immunometabolic axis between Type 2 Diabetes (T2D) and Chronic Lymphocytic Leukemia (CLL). The framework combines transcriptomic data harmonization, cross-cohort integration, feature selection, machine learning model development, and biomarker prioritization.
The primary objective is to identify robust, biologically interpretable gene signatures that are shared across T2D and CLL and to evaluate their predictive utility using reproducible machine learning pipelines.
- Multi-cohort preprocessing and harmonization of T2D and CLL gene expression datasets.
- AI-driven feature selection for identification of stable cross-disease biomarkers.
- Comparative machine learning modeling with cross-validation and external testing.
- Structured results export for downstream statistical analysis and figure generation.
- Notebook-based workflow designed for transparency and reproducibility.
git clone https://github.com/sazidshahjada/T2D-CLL-Biomarker-ML.git
cd T2D-CLL-Biomarker-MLconda create -n mlenv python=3.10 -y
conda activate mlenvInstall core scientific and machine learning libraries used by the notebooks.
pip install -r requirements.txtIf additional packages are required in your local environment, install them as needed based on notebook imports.
The analysis uses publicly available transcriptomic datasets for T2D and CLL.
- T2D cohorts are stored under
Raw_Data/T2D/. - CLL cohort is stored under
Raw_Data/CLL/. - Harmonized outputs are stored under
Clean_Data/andML_Data/.
- Ensure raw count/expression files and metadata files remain in their expected directories.
- Confirm sample identifiers match between expression matrices and metadata tables.
- Run preprocessing notebook first to generate cleaned and harmonized intermediate files.
Run the workflow in the following order.
jupyter notebook01_Data_Preprocessing_and_Harmonization.ipynb
02_Feature_Selection_and_Importance_Analysis.ipynb
03_Model_Comparison_and_Performance_Metrics.ipynb
- Processed datasets:
Clean_Data/,ML_Data/ - Model artifacts:
Saved_Models/ - Benchmark and evaluation metrics:
Results/ - Visual outputs:
Figures/
- Ingestion of raw transcriptomic matrices and sample metadata.
- Quality checks, normalization handling, and cohort harmonization.
- Construction of analysis-ready matrices for downstream modeling.
- Integration of disease-relevant molecular signals into a unified analysis matrix.
- Standardization and alignment of features across cohorts/studies.
- Identification of discriminative and stable features via statistical/ML-based methods.
- Ranking of candidate biomarkers by importance and stability criteria.
- Training and comparison of multiple classification models.
- Cross-validation and hyperparameter optimization.
- External dataset validation when available.
- Extraction of shared high-priority genes associated with both T2D and CLL.
- Compilation of candidate biomarkers in structured results files.
- Performance metrics across CV, test, and external test settings.
- Model-level comparison and robustness assessment.
- Use the same notebook execution order defined in the Usage section.
- Keep file paths and folder names unchanged.
- Run analyses in a clean environment and record package versions.
- Fix random seeds where available in the notebooks to reduce run-to-run variability.
- Store generated outputs under existing project directories for traceability.
This section should summarize key figures, tables, and quantitative findings from the paper.
Suggested additions:
- Main model performance table (CV/test/external).
- Top-ranked shared biomarkers table.
- Biological interpretation figures (pathway/network-level).
- Sensitivity/specificity and calibration plots.
Configuration is currently notebook-driven.
- Update dataset paths only if your local directory layout differs.
- Adjust model parameters inside notebook cells dedicated to tuning.
- Keep outputs directed to
Results/,Figures/, andSaved_Models/for consistency.
For long-term maintainability, consider adding a centralized config.yaml file in future revisions.
Contributions are welcome for method improvements, code optimization, documentation refinement, and extended validation analyses.
- Fork the repository.
- Create a feature branch.
- Commit clear, well-documented changes.
- Submit a pull request describing the motivation and validation.
This project is distributed under the terms specified in the LICENSE file.
We acknowledge the public repositories and research consortia that generated and shared the transcriptomic datasets used in this study, as well as the open-source scientific Python ecosystem that enabled this analysis.