Code-only LSF pipeline (manual bsub) for applying gene-specific calibrated AlphaMissense thresholds Chen/Pejaver 2026 to 28 cancer-predisposition genes in BioMe Cohort I (Regeneron) and Cohort II (Sema4), comparing against an existing ACMG P/LP set, and producing regression-ready tables.
All paths below are also encoded in config/config.yaml — that file is the single source of truth. This table is a quick reference.
| Purpose | Path | Notes |
|---|---|---|
| Per-chr VCFs | /sc/arion/projects/rg_huangk06/variants_PLP_BioMe/Regeneron/cancerChr/SINAI_Freeze_Two.GL.pVCF.PASS.onTarget.biallelic.{chr}.cancerGenes.vcf.gz |
{chr} = chr1..chr22 (also chrX exists but no panel genes are on it, so we only iterate autosomes) |
| Phenotype TSV | /sc/arion/projects/rg_huangk06/variants_PLP_BioMe/Regeneron/metadata/RegenWXS_HX_Newgroups.250109.tsv |
ID column: SINAI_ID |
| ACMG P/LP set | /sc/arion/projects/rg_huangk06/variants_PLP_BioMe/AlphaMissense/2cohort/CarrFreq/Regen_VariantsInSamplesPLPorPTV.tsv |
PTV rows excluded (not comparable to AM-missense) |
| Purpose | Path | Notes |
|---|---|---|
| Combined VCF | /sc/private/regen/data/Sema4/BioMe_Sema4_WES.vcf.gz |
READ-ONLY; never write beside it |
Local .csi for the combined VCF |
intermediate/cohortII/BioMe_Sema4_WES.vcf.gz.csi |
Written by index_cohortII_source.lsf (the discrete pre-step). 01 reads via source##idx##local.csi |
| Phenotype TSV | /sc/arion/projects/rg_huangk06/variants_PLP_BioMe/metadata/Sema4WXS_HX_Newgroups.250109.tsv |
ID column: MASKED_MRN |
| ACMG P/LP set | /sc/arion/projects/rg_huangk06/variants_PLP_BioMe/AlphaMissense/2cohort/CarrFreq/Sema4_VariantsInSamplesPLPorPTV.tsv |
Same schema as Cohort I |
| Purpose | Path | Notes |
|---|---|---|
| AlphaMissense scores | /sc/arion/projects/rg_huangk06/variants_PLP_BioMe/AlphaMissense/data/AlphaMissense_hg38.tsv.gz |
Step 00 subsets to the 28 target genes |
| Gencode v32 transcript→gene map | /sc/arion/projects/rg_huangk06/variants_PLP_BioMe/AlphaMissense/data/gencode.v32.transcriptID_genename.tsv |
Step 00 picks the canonical transcript per gene |
| Ancestry PCs | /sc/arion/projects/rg_huangk06/variants_PLP_BioMe/GSA_GDA_PCA_V2.txt |
PC id column auto-discovered (max overlap with phenotype IDs) |
| Chen/Pejaver variant-level calibration table | refs/zenodo/newAM_calibration_table_20260206.csv |
Step 00 wgets from Zenodo record 18668684 and untars |
| Repo location (this code) | /sc/arion/projects/rg_huangk06/variants_PLP_BioMe/Biome-alphamissense-calibration |
Submit scripts compute this at runtime |
-
Prepare references — login node, internet required.
bash scripts/00_prepare_refs.sh
Builds the calibration table, gene→transcript map, exon BED, AM subset (bgzip+tabix), and writes
refs/REFERENCE_REPORT.md. This is a hard gate — review the 28-gene table and coverage counts before submitting anything else. If the Zenodo fetch failed, drop the published calibration TSV at the path you set inconfig.calibration.local_fileand rerun. Likewise for the gencode GTF if the exon-BED source falls back toAM_minmax_fallback. -
Local behavioral tests — no LSF.
bash tests/run_test.sh
Runs steps 01→04 in-process on a synthetic mini dataset, asserts 8 behavioral cases (the 7 from the spec plus an AB-filter-syntax check), and confirms the gVCF guard aborts with the prescribed message.
-
Cohort I (Regeneron, per-chr VCFs) — submit on a login node.
bash scripts/submit_cohortI.sh
Chains
01[1-22]→02[1-22]→02_gather→03→04viabsub -w "done(...)". Each step is submitted via a generated wrapper script that hardcodesBIOAM_REPO_ROOT,COHORT, andCONFIG_PATHasexportstatements inside the wrapper body — this is necessary because Minerva LSF does not propagate user env vars throughbsub -envreliably. -
Cohort II (Sema4, single combined VCF in
/sc/private/) — submit on a login node.bash scripts/submit_cohortII.sh
Chains
index_cohortII_source→01[1-22]→02[1-22]→02_gather→03→04. The pre-step writes a.csifor the read-only source underintermediate/cohortII/; the 22 array tasks thenbcftools view -r chr<N>againstsource##idx##local.csi.
For debugging or re-running a failed array task without re-submitting the whole pipeline:
bash scripts/submit_one_chr.sh cohortI 01_qc_missense 17
bash scripts/submit_one_chr.sh cohortI 02_annotate_am 17Both 01_qc_missense and 02_annotate_am are array steps and can be re-run per chromosome. Other steps (02_gather, 03, 04) are whole-cohort and have no array index.
- Collect deliverables.
results/<cohort>/A_variant_level_per_person.tsvresults/<cohort>/B_ACMG_vs_AM_comparison.tsv+B_summary_by_gene.tsvresults/<cohort>/C_regression_matrix.tsvresults/SUMMARY.md(per-cohort match rates, carrier counts per gene group, gene-coverage table)
| stage | script | input | output |
|---|---|---|---|
| 00 | scripts/00_prepare_refs.sh |
AM TSV, gencode map, Zenodo calibration, gencode v32 GTF | refs/ artifacts + REFERENCE_REPORT.md |
| 01 | scripts/01_qc_missense.lsf (array) |
per-chr VCFs (Cohort I) OR sliced combined VCF (Cohort II) | intermediate/<cohort>/chr<N>.qc.vcf.gz |
| 02 | scripts/02_annotate_am.lsf (array) |
per-chr filtered VCF, AM subset | intermediate/<cohort>/chr<N>.am_annotated.tsv |
| 02g | scripts/02_gather.lsf (one-shot) |
the 22 per-chr TSVs | intermediate/<cohort>/am_annotated.tsv (single header) |
| 03 | scripts/03_call_carriers.lsf |
gathered annotated TSV + thresholds | intermediate/<cohort>/A_variant_level_per_person.tsv |
| 04 | scripts/04_compare_and_tabulate.lsf |
Table A, phenotype, PCs, ACMG P/LP | results/<cohort>/{A,B,B_summary,C}.tsv, results/SUMMARY.md |
| pre | scripts/index_cohortII_source.lsf |
/sc/private/.../Sema4.vcf.gz |
intermediate/cohortII/<basename>.csi |
- Threshold policy. Primary call uses gene-specific calibrated thresholds only. Domain-aggregate thresholds are RECORDED in the output but NEVER used to promote a carrier. Genes with neither calibration are retained in Table A with their
am_pathogenicityandthreshold_source=uncovered, but excluded from the AM columns of Table C. No global 0.864 fallback anywhere. Changecalibration.primary_thresholdinconfig/config.yamlif (and only if) the policy needs to change — there is no other knob. - Missense scope = AM-join. A variant is in-scope iff it matches an AM row for one of the 28 target genes' transcripts. No separate consequence annotator (VEP/SnpEff/...) is invoked.
- SNV-only. Multiallelic records cause a hard abort; AM-matched variants must be single-base SNVs (also enforced).
- No AF filter. The VCFs have no site AF field and the design explicitly forbids one.
- QC.
FILTER==PASSsite filter; per-genotypeDP≥10,GQ≥20, het AB in[0.2, 0.8]; hom-ref and missing dropped. All knobs inconfig.yaml > qc.*. The filter expression itself is centralized inlib/common.sh::build_qc_expr(single source of truth). - gVCF guard. Aborts with the exact message
Input appears to be a combined gVCF, not genotyped calls; point at the GenotypeGVCFs output. - gather-and-verify gate. Step 03 independently re-verifies all 22 per-chr outputs even when submitted with
bsub -w "done(...)"— array tasks can silently fail. - Read-only Cohort II source. Nothing is ever written under
/sc/private/. The source is indexed via a discrete pre-step (NOT inside the array) so tasks 2–22 don't race the index. - ACMG comparison. Default filter:
annotationcolumn contains"P/LP". PTV rows are excluded with a warning that PTV is not comparable to AM-missense (config.acmg.annotation_filter_substr). - Per-cohort ID columns. Cohort I links phenotype on
SINAI_ID; Cohort II onMASKED_MRN. Configurable; match rate is reported and low rates trigger a warning that prints up to 5 unmatched IDs from each side.
- Paths / config:
config/config.yaml— everything is there. Do not hard-code paths in scripts. - QC parameters:
config.yaml > qc:. Reflected automatically intolib/common.sh::build_qc_expr. - Module names (Minerva-specific):
lib/load_modules.sh. Lines are markedVERIFYbecause Minerva module names are site-specific. - bcftools filter syntax (per-version differences):
lib/common.sh::build_qc_expr. The script's banner printsbcftools --versionand pokes the expression at job start so a mismatch fails fast. - Gene panel / Table S5 groups:
config.yaml > target_genesandgene_groups.
The phenotype TSVs, the masked-MRN bridge, the PCA file, and any Tables A/B/C derived from real BioMe samples are patient-identifiable and must never land in git. Two layers of defense:
.gitignoreblocks the filename patterns directly (real-data file names plusintermediate/andresults/). Synthetic fixtures undertests/data/are intentionally allowed.- Install the pre-commit hook in each clone before committing:
The hook rejects staged paths matching the real-data patterns and scans staged content for
ln -sf ../../scripts/pre-commit.sh .git/hooks/pre-commit chmod +x .git/hooks/pre-commit
SINAI_*IDs and the MRN-map header. Runbash scripts/pre-commit.shdirectly to test.
If a sensitive file ever slips in, halt and rewrite history before pushing.
- CPU only. GPU not used: this is
bcftoolsstreaming + pandas joins; no model inference, no GPU-amenable workload here. AM scores are precomputed in the public table. - Modules expected on Minerva (VERIFY against your site names):
bcftools,htslib(providesbgzip+tabix),python(≥3.9 withpandas,pyyaml,requests). - Conda fallback when modules aren't available:
conda env create -f environment.yml && conda activate biome-am.
- VCFs are GenotypeGVCFs output (biallelic, real GTs). The gVCF guard catches the most common slip.
- AM transcript IDs match gencode v32 transcript IDs by exact versioned ID (with a version-stripped fallback). Any gene whose AM transcript can't be located is listed in
REFERENCE_REPORT.md. - PCs are joined on a column auto-discovered by maximum overlap with phenotype IDs; can be overridden via
config.yaml > pcs.id_column. - Calibration thresholds are fetched/parsed heuristically; if the Zenodo file format changes,
python/prepare_refs.py::_try_parse_calibration_tableis the only spot to update.
The pipeline is intentionally stage-wise and idempotent. To rerun a single chromosome: delete its intermediate/<cohort>/chr<N>.qc.vcf.gz (and the matching .am_annotated.tsv) and bsub only that array index (e.g. bsub -J biome_am_01_cohortI[5] after editing the array spec, or run the body directly with LSB_JOBINDEX=5 COHORT=cohortI bash scripts/01_qc_missense.lsf). To rerun a whole cohort, delete intermediate/<cohort>/ and rerun submit_<cohort>.sh.