Shalev Yaacov — M.Sc. Researcher, Tabach Lab, Hebrew University of Jerusalem. Computational pipeline development for novel gene discovery in Inherited Retinal Diseases, integrating evolutionary profiling, phenotypic clustering, and multi-omics data.
This repository is a curated public sample of the computational methods and tools I develop as part of an ongoing M.Sc. thesis at the Hebrew University of Jerusalem, aimed at identifying novel disease-causing genes in Inherited Retinal Diseases (IRD) that current diagnostic panels miss.
The research partitions 450+ IRD-associated genes into co-evolved functional modules using Normalized Phylogenetic Profiling (NPP) across ~2,000 genomes and HPO-based phenotypic clustering, builds a bidirectional phenotype-to-gene inference engine for clinical prioritization, and derives a genome-scale candidate ranking model validated against 4,254+ real-world cases in collaboration with Hadassah Medical Center. The approach deliberately combines phenotypic, evolutionary, and functional signals — rather than relying on any single one alone — into a unified discovery pipeline. Its findings remain unpublished pending the methodology manuscript in preparation.
What is here is a structured window into the methods: five focused demonstrations and tools drawn from the same framework, presented roughly in the order the corresponding stages are used, each running on fully synthetic data at a fraction of the real scale. They are not the pipeline, and not its results — they are illustrations of the reasoning and implementation behind it. A reader who works through them will understand the approach; a reader looking for the findings should wait for the manuscript.
For an interactive walkthrough of the research — click to launch:
Narrative walkthrough of the research — NPP methodology, project case studies, and contact.
portfolio/
├── README.md
├── Visual_Portfolio/
│ └── index.html
└── projects/
├── IRD_Phenotype_Clustering/
│ ├── README.md
│ ├── RATIONALE.md
│ └── scripts/ird_phenotype_clustering_demo.ipynb
├── Cilia_Module_Validation/
│ ├── README.md
│ └── scripts/cilia_clusters_demo.ipynb
├── LPP_NPP_Heatmap_Visualization/
│ ├── README.md
│ ├── scripts/gene_list_to_lpp_heatmap.R
│ ├── scripts/gene_list_to_npp_heatmap.R
│ └── scripts/lpp_multi_cluster_heatmap_with_inclusion.R
├── LBS_Consensus_Profiling/
│ ├── README.md
│ └── notebook/consensus_profile_demo.ipynb
└── IRD_HPO_Anatomogram/
├── README.md
└── scripts/hpo_organ_mapping_demo.ipynb
Each project contains a README with scientific rationale, workflow
description, and usage instructions. IRD_Phenotype_Clustering additionally
includes a dedicated RATIONALE.md with an extended discussion of method
design decisions.
The five projects fall into three categories reflecting their role in the full pipeline.
Core methodological steps — each demonstrated on synthetic data at reduced scale
HPO-based semantic similarity clustering of IRD genes into functional disease modules
Computes pairwise semantic similarity across IRD gene HPO annotations using Lin similarity with Best-Match Average, constructs a gene–gene similarity graph, and applies Leiden community detection to produce phenotype-driven disease modules. Includes IC-based term filtering, perturbation stability scoring, and Fisher's exact module characterization.
Consensus-based evolutionary barcode detection for gene module signatures
Explores a complementary evolutionary approach: identifying a Local Barcode Segment — a species window where a gene group displays a distinctive, concentrated conservation pattern — and using consensus profiles built from that window as query signatures for genome-wide candidate retrieval. Compares five aggregation strategies with coherence validation and a noise-based negative control.
Module structure tested against a well-characterized functional gene class
Cross-validation of module structure against curated ciliopathy gene evidence
Applies the clustering and annotation logic to ciliopathy genes — a well-characterized functional class with curated external evidence (CiliaCarta, literature). Serves as an interpretability check: module structure should recover known biology before it is trusted to reveal unknown biology.
Visualization and translation layers used across multiple pipeline stages
Publication-ready LPP and NPP phylogenetic profile heatmaps via ComplexHeatmap
R scripts for generating species-aligned, clade-annotated heatmaps from Local and Normalized Phylogenetic Profile matrices. Supports single gene lists, multi-cluster layouts with Inclusion Criterion annotation, and both LPP (presence/absence, 0–1) and NPP (z-score, diverging scale) profile types. Output figures were used directly in manuscript preparation.
Translating gene module phenotype signatures into anatomical body maps
Maps HPO phenotype annotations from gene modules to organ systems and renders color-coded anatomograms for clinical and non-computational audiences. The non-trivial step — traversing the HPO DAG from fine-grained terms upward to anatomical categories — is what makes the output meaningful rather than arbitrary.
- Python: Pandas, NumPy, SciPy, scikit-learn, NetworkX, pronto
- R: ComplexHeatmap, circlize, ggplot2, gganatogram
- Methods: semantic similarity, graph-based clustering, phylogenetic profiling, Naive Bayes probabilistic scoring, multi-omics integration
- Lab: Prof. Yuval Tabach Lab, Faculty of Medicine
- Institution: Hebrew University of Jerusalem
- Role: M.Sc. Candidate in Genomics & Bioinformatics
All datasets in this repository are fully synthetic. No real patient data, unpublished genomic results, or proprietary laboratory resources are included. The actual pipeline, its scale, and its findings remain confidential pending publication.
This repository evolves alongside my thesis research and the methodology manuscript currently in preparation, and will continue to expand with new analyses and demonstrations.