name	IOBRskill
description	End-to-end bulk RNA-seq / microarray tumor microenvironment (TME) analysis using the IOBR R package. Covers data preprocessing, gene annotation, TME deconvolution (CIBERSORT/MCPcounter/EPIC/xCell/ESTIMATE/TIMER/quanTIseq/IPS), signature scoring (PCA/ssGSEA/Z-score), ligand-receptor analysis, pathway enrichment, survival modeling, and Nature-style visualization. Use this skill when the user mentions: tumor microenvironment analysis, TME deconvolution, immune infiltration analysis, 肿瘤微环境分析, 肿瘤微环境解析, 免疫浸润分析, ligand-receptor analysis, 受体配体分析, IOBR, immune cell deconvolution, CIBERSORT, MCPcounter, ESTIMATE, or any bulk RNA-seq immuno-oncology workflow. Also trigger on /IOBRskill.

name

IOBRskill

description

End-to-end bulk RNA-seq / microarray tumor microenvironment (TME) analysis using the IOBR R package. Covers data preprocessing, gene annotation, TME deconvolution (CIBERSORT/MCPcounter/EPIC/xCell/ESTIMATE/TIMER/quanTIseq/IPS), signature scoring (PCA/ssGSEA/Z-score), ligand-receptor analysis, pathway enrichment, survival modeling, and Nature-style visualization. Use this skill when the user mentions: tumor microenvironment analysis, TME deconvolution, immune infiltration analysis, 肿瘤微环境分析, 肿瘤微环境解析, 免疫浸润分析, ligand-receptor analysis, 受体配体分析, IOBR, immune cell deconvolution, CIBERSORT, MCPcounter, ESTIMATE, or any bulk RNA-seq immuno-oncology workflow. Also trigger on /IOBRskill.

IOBRskill — Bulk Transcriptome Tumor Microenvironment Analysis

IOBR Citation (MUST include in every R script and log)

Every R script generated by this skill MUST begin with this citation header:

###############################################################################
# IOBR Citation:
# Zeng DQ, Fang YR, … , Yu GC, Liao WJ.
# Enhancing Immuno-Oncology Investigations Through Multidimensional Decoding of
# Tumour Microenvironment with IOBR 2.0, Cell Reports Methods, 2024
# https://doi.org/10.1016/j.crmeth.2024.100910
###############################################################################

The same citation header MUST appear at the top of every log file in 06-log/.

Prerequisites

Environment Check (first step)

Verify IOBR is installed in the base conda environment:

source /home/bio/miniconda3/etc/profile.d/conda.sh
conda activate base
Rscript -e 'packageVersion("IOBR")'

If IOBR is not installed:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("IOBR/IOBR")

Required Dependencies

IOBR relies on several packages that are NOT auto-installed. Install these before starting:

# For CIBERSORT
BiocManager::install("preprocessCore")

# For quanTIseq deconvolution
install.packages("limSolve", repos = "https://cloud.r-project.org")

# For tme_cluster()
install.packages("NbClust", repos = "https://cloud.r-project.org")

# For iobr_pca()
install.packages("FactoMineR", repos = "https://cloud.r-project.org")

# For sig_heatmap() — only if using tidyHeatmap approach (pheatmap preferred)
# install.packages("tidyHeatmap", repos = "https://cloud.r-project.org")

IOBR Built-in Data

IOBR ships with rich built-in data — you rarely need external annotation files. For full details, read references/iobr_built_in_data.md. Key resources:

Resource	Access	What It Contains
Gene length / ID mapping	`count2tpm()` auto-loads	111 genes with HUGO, ENTREZID, ENSEMBL
Microarray probe sets	`extdata/probesets.txt`	176 probes mapped to cell types
TME signatures (186)	`IOBR:::signature_tme`	Immune, stromal, EMT, checkpoint signatures
Full collection (323)	`data("signature_collection")`	All signatures + Hallmark + Metabolism
Signature groupings	`data("sig_group")`	35+ categories for filtering signatures
Pathway data (on-demand)	`IOBR::load_data()`	hallmark (50), go_bp (7658), kegg (186), reactome (1615)
Example datasets	`data("pdata_stad")` etc.	TCGA-STAD, IMvigor210 toy data

On-demand pathway data via `load_data()`

IOBR provides IOBR::load_data() to download and cache pathway gene sets from GitHub on first use. These are essential for comprehensive pathway scoring:

hallmark  <- IOBR::load_data("hallmark")   # 50 pathways
go_bp     <- IOBR::load_data("go_bp")      # 7,658 gene sets
go_cc     <- IOBR::load_data("go_cc")
go_mf     <- IOBR::load_data("go_mf")
kegg      <- IOBR::load_data("kegg")       # 186 pathways
reactome  <- IOBR::load_data("reactome")   # 1,615 pathways

Project Directory Structure

Create this structure before starting analysis:

project/
├── 01-script/         # R scripts (numbered, each with IOBR citation header)
├── 02-input/          # Normalized and annotated expression matrices
│   └── pdata_summary.csv          # Phenotype variable summary
├── 03-tme/            # TME deconvolution and signature scoring results
├── 04-figs/
│   ├── *.png/pdf      # All figures (PNG 300dpi + PDF, numbered naming)
│   ├── kmplot/        # KM survival plots (auto-saved by sig_surv_plot)
│   └── data/          # Statistical result tables matching figures
├── 05-note/
│   ├── IOBR-pipeline.md           # Analysis plan (generated in Phase 0)
│   └── IOBR-analysis-README.md    # Final analysis summary
└── 06-log/            # R execution logs (one per script, with citation header)

Naming conventions:

Scripts: 01-data_preprocessing.R, 02-tme_deconvolution.R, ...
Figures: Fig01-barplot_cibersort.png + Fig01-barplot_cibersort.pdf (always dual format)
Statistical tables: 04-figs/data/01-tme_pdata_merged.csv, 04-figs/data/04-wilcoxon_results.csv, ...
Logs: 01-data_preprocessing.log, 02-tme_deconvolution.log, ...
Note: IOBR-analysis-README.md (tree diagram + brief interpretation)

When running a script, capture the log:

Rscript 01-script/01-data_preprocessing.R > 06-log/01-data_preprocessing.log 2>&1

Analysis Pipeline

6 sequential phases (Phase 0–5). Pause at decision points to let the user choose.

Phase 0: Plan Generation

Before writing any analysis code, generate an analysis plan document.

After confirming data type / species / deconvolution methods with the user (Phase 1 questions), create 05-note/IOBR-pipeline.md with:

Project metadata: project name, data source, data type, species
ASCII tree diagram of all scripts, their inputs/outputs, and expected results:

01-data_preprocessing.R
  ├─ Input:  raw expression matrix + phenotype data
  ├─ Output: 02-input/annotated_eset.csv, 02-input/pdata.csv
  └─ Notes:  log2 transformation, gene ID conversion

02-tme_deconvolution.R
  ├─ Input:  02-input/annotated_eset.csv
  ├─ Output: 03-tme/tme_cibersort.csv, 03-tme/tme_mcpcounter.csv, ...
  └─ Notes:  CIBERSORT + MCPcounter + ESTIMATE (linear-scale data)

03-signature_analysis.R
  ├─ Input:  02-input/annotated_eset.csv
  ├─ Output: 03-tme/sig_score_tme.csv, 03-tme/sig_score_pathway.RData
  └─ Notes:  ssGSEA for TME signatures, Hallmark + KEGG pathways

04-statistical_analysis.R
  ├─ Input:  03-tme/*.csv + 02-input/pdata.csv
  ├─ Output: 04-figs/data/01-tme_pdata_merged.csv
  │          04-figs/data/02-tme_scaled.csv
  │          04-figs/data/03-tme_subtype.csv
  │          04-figs/data/04-wilcoxon_results.csv
  │          04-figs/data/05-surv_results.csv
  │          04-figs/data/06-cor_matrix.csv
  └─ Notes:  merge → scale → cluster → wilcoxon → survival → correlation

05-visualization.R
  ├─ Input:  04-figs/data/*.csv + 03-tme/*.csv
  ├─ Output: 04-figs/Fig01-09 (PNG + PDF)
  └─ Notes:  barplots → heatmaps → correlation → boxplots → forest → KM

Present to user: "开始分析" or "调整计划"
Only after approval, begin Phase 1

Phase 1: Data Preprocessing and Annotation

Script: 01-data_preprocessing.R

Confirm with the user:

Data source: User matrix, GEO (GSE#####), or TCGA
Data type: "Count" (raw RNA-seq), "Tpm" (already normalized), or "Array" (microarray)
Species: "hsa" (human) or "mmu" (mouse)

Sample matching (if both eset and pdata exist):

Before any analysis, verify sample overlap between expression matrix and phenotype data:

# Report sample counts
eset_samples  <- colnames(eset)
pdata_samples <- pdata$ID   # or rownames(pdata)
intersection  <- intersect(eset_samples, pdata_samples)

cat("Expression matrix samples:", length(eset_samples), "\n")
cat("Phenotype data samples:",    length(pdata_samples), "\n")
cat("Matched samples:",           length(intersection), "\n")

# Ask user: use all samples, intersection only, or check data?
# Then filter both to agreed sample set
eset  <- eset[, intersection]
pdata <- pdata[pdata$ID %in% intersection, ]

pdata variable summary:

After loading pdata, generate a summary report:

# Generate pdata_summary.csv
pdata_summary <- data.frame(
  var       = character(),
  type      = character(),
  n_levels  = integer(),
  levels    = character(),
  stringsAsFactors = FALSE
)

for (col in colnames(pdata)) {
  vals <- pdata[[col]]
  if (is.numeric(vals)) {
    tp <- ifelse(length(unique(vals)) <= 2, "binary", "numeric")
    lvls <- paste(range(vals, na.rm = TRUE), collapse = " - ")
    n_lv <- length(unique(vals))
  } else {
    tp <- "factor"
    lvls <- paste(unique(vals)[1:min(10, length(unique(vals)))], collapse = ", ")
    n_lv <- length(unique(vals))
  }
  pdata_summary <- rbind(pdata_summary, data.frame(
    var = col, type = tp, n_levels = n_lv, levels = lvls
  ))
}

write.csv(pdata_summary, file = "02-input/pdata_summary.csv", row.names = FALSE)
cat(">>>>> pdata summary saved to 02-input/pdata_summary.csv\n")

This summary helps decide which statistical tests to apply in Phase 4.

Key functions:

Function	Purpose	When to Use
`count2tpm()`	Count → TPM + gene ID mapping	Raw RNA-seq counts
`log2eset()`	Auto-detect and apply log2 transform	Ensures log2 scale for downstream analysis
`anno_eset()`	Probe → gene symbol annotation	Microarray data or unannotated matrices
`remove_batcheffect()`	ComBat / ComBat-seq batch correction	Multi-batch datasets
`find_outlier_samples()`	WGCNA-based outlier detection	QC step
`iobr_pca()`	PCA visualization for batch assessment	QC step
`mouse2human_eset()`	Mouse → human gene ortholog conversion	Mouse data

# RNA-seq: counts → TPM
eset <- count2tpm(countMat = expr_matrix, idType = "Ensembl", org = "hsa")

# Microarray: probe annotation
eset <- anno_eset(eset = expr_matrix, annotation = anno_df,
                  symbol = "gene_symbol", probe = "probe_id", method = "mean")

# Ensure log2 scale (auto-detects — only transforms if not already log2)
eset <- log2eset(eset)

Save output:

write.csv(eset,  file = "02-input/annotated_eset.csv")
write.csv(pdata, file = "02-input/pdata.csv")

Phase 2: TME Deconvolution

Script: 02-tme_deconvolution.R

Ask the user which method(s) to use.

Method	Cell Types	Best For
`"cibersort"`	22 (LM22)	Gold standard, comprehensive
`"cibersort_abs"`	22 (LM22)	CIBERSORT absolute mode (fraction-independent)
`"mcpcounter"`	8 populations	Quick, stromal + immune
`"epic"`	8 incl. cancer cells	Tumor purity
`"xcell"`	64 types	Broadest coverage (slow)
`"estimate"`	4 scores	Stromal/Immune/Tumor purity
`"timer"`	6 types	TCGA cancer-specific
`"quantiseq"`	10 incl. M1/M2	Macrophage polarization
`"ips"`	4 axes	Immunotherapy response

Recommended: CIBERSORT + MCPcounter + ESTIMATE (most cited). For comprehensive analysis, run all 9 methods and merge.

Single-method usage:

tme_result <- deconvo_tme(eset = eset, method = "cibersort", arrays = FALSE, perm = 1000)
write.csv(tme_result, file = "03-tme/tme_cibersort.csv")

Full 9-method deconvolution + merge (production workflow):

# Each method may take 1-5 minutes depending on sample size
# Use tryCatch for each method — some may fail on certain datasets
cibersort <- tryCatch(deconvo_tme(eset = eset, method = "cibersort",
                                   arrays = FALSE, perm = 1000), error = function(e) NULL)
cibersort_abs <- tryCatch(deconvo_tme(eset = eset, method = "cibersort_abs",
                                       arrays = FALSE, perm = 1000), error = function(e) NULL)
epic      <- tryCatch(deconvo_tme(eset = eset, method = "epic",
                                   arrays = FALSE), error = function(e) NULL)
mcp       <- tryCatch(deconvo_tme(eset = eset, method = "mcpcounter"), error = function(e) NULL)
xcell     <- tryCatch(deconvo_tme(eset = eset, method = "xcell",
                                   arrays = FALSE), error = function(e) NULL)
estimate  <- tryCatch(deconvo_tme(eset = eset, method = "estimate"), error = function(e) NULL)
timer     <- tryCatch(deconvo_tme(eset = eset, method = "timer",
                                   group_list = rep(tumor_type, ncol(eset))), error = function(e) NULL)
quantiseq <- tryCatch(deconvo_tme(eset = eset, method = "quantiseq",
                                   tumor = TRUE, arrays = FALSE, scale_mrna = TRUE), error = function(e) NULL)
ips       <- tryCatch(deconvo_tme(eset = eset, method = "ips", plot = FALSE), error = function(e) NULL)

# Merge all results by sample ID using base R merge()
results_list <- list(cibersort, cibersort_abs, mcp, xcell, epic, estimate, quantiseq, timer, ips)
results_list <- results_list[!sapply(results_list, is.null)]

tme_combine <- results_list[[1]]
for (i in seq_along(results_list)[-1]) {
  tme_combine <- merge(tme_combine, results_list[[i]], by = "ID")
}

save(tme_combine, file = "03-tme/tme_combine.RData")
write.csv(tme_combine, file = "03-tme/tme_combine.csv", row.names = FALSE)

Critical caveats:

All deconvolution methods (except IPS) need linear-scale data — Anti-log first: eset_linear <- 2^eset if log2-transformed. CIBERSORT, MCPcounter, ESTIMATE, EPIC, quanTIseq, xCell all warn "Data values appear small (<50)" when fed log2 data. Only IPS works with log2 data directly.
Column names have method suffix — T_cells_CD8_CIBERSORT, T_cells_MCPcounter, etc.
Output CSV has ID column — Set rownames: rownames(df) <- df$ID; df$ID <- NULL.
arrays = TRUE for microarray; perm = 1000 recommended for CIBERSORT.
Required packages — preprocessCore (CIBERSORT), limSolve (quanTIseq). Install before running.
TIMER requires group_list — Provide cancer type per sample: rep(tumor_type, ncol(eset)).
quanTIseq parameters — Use tumor = TRUE, arrays = FALSE, scale_mrna = TRUE for tumor samples.
Use tryCatch for each method — some methods may fail on certain datasets. Wrap each call to avoid losing partial results.

Phase 3: Signature Scoring and Pathway Analysis

Script: 03-signature_analysis.R

Scoring methods (ask user to choose):

Method	Best For
`"ssgsea"`	Default, robust
`"pca"`	Continuous scores, large gene sets
`"zscore"`	Simple, fast
`"integration"`	Combined approach

Step 3a: TME / IO biomarker signature scoring:

# Option A: TME signatures (186 sets, internal — use :::)
signature_tme <- IOBR:::signature_tme
sig_tme <- calculate_sig_score(pdata = NULL, eset = eset,
                                signature = signature_tme,
                                method = "ssgsea",
                                mini_gene_count = 3,
                                adjust_eset = TRUE)
write.csv(sig_tme, file = "03-tme/sig_score_tme.csv")

# Option B: Full collection (323 sets) — for comprehensive analysis
data("signature_collection")
sig_res <- calculate_sig_score(pdata = NULL, eset = eset,
                                signature = signature_collection,
                                method = "pca",
                                mini_gene_count = 2,
                                adjust_eset = TRUE)
write.csv(sig_res, file = "03-tme/sig_score_collection.csv")

Step 3b: Pathway scoring (Hallmark + GO + KEGG + Reactome):

hallmark  <- IOBR::load_data("hallmark")    # 50 pathways
go_bp     <- IOBR::load_data("go_bp")       # 7,658 gene sets
go_cc     <- IOBR::load_data("go_cc")
go_mf     <- IOBR::load_data("go_mf")
kegg      <- IOBR::load_data("kegg")        # 186 pathways
reactome  <- IOBR::load_data("reactome")    # 1,615 pathways

sig_pathway <- calculate_sig_score(pdata = NULL, eset = eset,
                                    signature = c(hallmark, go_bp, go_cc, go_mf, kegg, reactome),
                                    method = "ssgsea",
                                    mini_gene_count = 3)
write.csv(sig_pathway, file = "03-tme/sig_score_pathway.csv")

Step 3c: Merge all results into master table:

tme_sig_combine <- merge(tme_combine, sig_res, by = "ID")
tme_sig_combine <- merge(tme_sig_combine, sig_pathway, by = "ID")
write.csv(tme_sig_combine, file = "03-tme/tme_sig_combine.csv", row.names = FALSE)

Key notes:

pdata = NULL, adjust_eset = TRUE — This is the correct production pattern for calculate_sig_score(). Do NOT set adjust_eset = FALSE unless data is already properly scaled.
Method names are lowercase — Use "pca", "ssgsea", "zscore", "integration" (NOT uppercase "PCA").
signature_tme is internal — Access via IOBR:::signature_tme (triple colon).
signature_collection needs data() — Call data("signature_collection") before use.
IOBR::load_data() downloads on first use — Caches locally for subsequent calls.
Pathway scoring is computationally heavy — Consider running Hallmark (50) and KEGG (186) first for quick results, then GO/Reactome for comprehensive analysis.
Use base R merge() for joining — Not all environments have tidyverse. Prefer merge(df1, df2, by = "ID") over %>% inner_join().
tme_cluster() requires NbClust — Install: install.packages("NbClust").

Use references/iobr_built_in_data.md for the full signature catalog and data("sig_group") for filtering by category.

TME subtype clustering:

tme_clusters <- tme_cluster(input = tme_result, features = cell_types,
                             id = "sample_id", method = "kmeans")

Phase 4: Statistical Analysis

Script: 04-statistical_analysis.R

This phase produces structured output: every statistical result is saved to 04-figs/data/ as a numbered CSV file for traceability and reuse in Phase 5 visualization.

4a. Data Preparation

Merge TME results with pdata, then scale all TME cell columns:

# Load data
pdata <- read.csv("02-input/pdata.csv")
tme   <- read.csv("03-tme/tme_combine.csv")

# Merge pdata with TME results
tme_pdata <- merge(pdata, tme, by = "ID")
write.csv(tme_pdata, file = "04-figs/data/01-tme_pdata_merged.csv", row.names = FALSE)

# Collect TME cell columns — MUST match actual column names in merged data
# After merge, cibersort + cibersort_abs shared names get .x/.y suffixes
# So match by checking actual colname OR colname-without-suffix against original list
qc_patterns <- c("^P[.]", "^Correlation", "^RMSE")
all_tme_cell_cols <- colnames(tme_pdata)[
  colnames(tme_pdata) %in% original_cell_cols |
  gsub("\\.[xy]$", "", colnames(tme_pdata)) %in% original_cell_cols
]
all_tme_cell_cols <- setdiff(all_tme_cell_cols, "ID")

# Z-score scale all TME cell columns (per column = per variable)
tme_scaled <- tme_pdata
tme_scaled[, all_tme_cell_cols] <- scale(tme_scaled[, all_tme_cell_cols])
write.csv(tme_scaled, file = "04-figs/data/02-tme_scaled.csv", row.names = FALSE)

4b. TME Subtyping

Cluster samples by TME cell composition using CIBERSORT fractions:

# Extract CIBERSORT cell columns for clustering
cb_cols <- grep("_CIBERSORT$", colnames(tme_pdata), value = TRUE)
cb_cols <- setdiff(cb_cols, grep("P[.]value|Correlation|RMSE", cb_cols, value = TRUE))

# K-means clustering on scaled CIBERSORT data
set.seed(123)
tme_subtype <- tme_cluster(input = tme_scaled, features = cb_cols,
                            id = "ID", method = "kmeans", max.nc = 6)

# tme_cluster returns data.frame with columns: ID, cluster, <cell_cols>
# Extract cluster assignments and label as TME subtypes
subtype_df <- data.frame(ID = tme_subtype$ID, TME_subtype = paste0("TME", tme_subtype$cluster))
write.csv(subtype_df, file = "04-figs/data/03-tme_subtype.csv", row.names = FALSE)

# Find representative TME variables per subtype
markers <- find_markers_in_bulk(pdata = tme_scaled, eset = tme_scaled,
                                 group = "TME_subtype", nfeatures = 50)
write.csv(markers, file = "04-figs/data/03-tme_subtype_markers.csv", row.names = FALSE)

4c. Statistical Testing (conditional)

Run statistical tests based on what's available in pdata:

# --- Wilcoxon test (IF categorical variables exist in pdata) ---
# Check pdata_summary.csv for factor/binary variables
pdata_summary <- read.csv("02-input/pdata_summary.csv")
cat_vars <- pdata_summary$var[pdata_summary$type %in% c("factor", "binary")]

if (length(cat_vars) > 0) {
  for (v in cat_vars) {
    lvls <- unique(tme_pdata[[v]])
    lvls <- lvls[!is.na(lvls)]
    if (length(lvls) == 2) {
      wilcox_res <- batch_wilcoxon(data = tme_pdata, target = v, feature = cell_cols)
      write.csv(wilcox_res,
                file = paste0("04-figs/data/04-wilcoxon_", v, ".csv"),
                row.names = FALSE)
    }
  }
}

# --- Survival analysis (IF survival data exists) ---
# Check for time + status columns
time_cols   <- grep("time|Time|OS_time|PFS_time", colnames(tme_pdata), value = TRUE)
status_cols <- grep("status|Status|OS_status|PFS_status", colnames(tme_pdata), value = TRUE)

if (length(time_cols) > 0 && length(status_cols) > 0) {
  time_col   <- time_cols[1]
  status_col <- status_cols[1]

  # Batch Cox regression for all TME cell columns
  surv_res <- batch_surv(pdata = tme_pdata, variable = cell_cols,
                          time = time_col, status = status_col)
  write.csv(surv_res, file = "04-figs/data/05-surv_results.csv", row.names = FALSE)

  # Per-method survival (if multiple deconvolution methods were run)
  methods_run <- unique(gsub(".*_", "", cell_cols))
  for (m in methods_run) {
    m_cols <- grep(paste0("_", m, "$"), cell_cols, value = TRUE)
    if (length(m_cols) > 0) {
      m_surv <- batch_surv(pdata = tme_pdata, variable = m_cols,
                           time = time_col, status = status_col)
      write.csv(m_surv,
                file = paste0("04-figs/data/05-surv_", m, ".csv"),
                row.names = FALSE)
    }
  }
}

Key notes for survival:

batch_surv() requires merged pdata — All variables must be columns in pdata. Use merge() first.
batch_surv() parameter is variable — A vector of column names, NOT a separate data frame.
batch_surv() output columns: P, HR, CI_low_0.95, CI_up_0.95, ID.

4d. Correlation

Correlation matrix of all TME variables:

# batch_cor(data, target, feature, method) — correlates one target vs multiple features
# NOTE: batch_cor() returns target-vs-each-feature, NOT a pairwise matrix.
# For a full pairwise correlation matrix, use base R cor() instead (see below).
# Key immune cells as target, all TME cells as features
key_cells <- c("T_cells_CD8_CIBERSORT", "Macrophages_M1_CIBERSORT")
cor_all <- data.frame()
for (tc in key_cells) {
  cor_res <- batch_cor(data = tme_pdata, target = tc, feature = cell_cols, method = "spearman")
  cor_res$target <- tc
  cor_all <- rbind(cor_all, cor_res)
}
write.csv(cor_all, file = "04-figs/data/06-cor_matrix.csv", row.names = FALSE)

# For full pairwise correlation matrix (for heatmap visualization), use base R:
cor_mat <- cor(tme_pdata[, cell_cols], method = "spearman", use = "pairwise.complete.obs")
save(cor_mat, file = "04-figs/data/06-cor_matrix_full.RData")

4e. Other Statistical Functions

Function	Purpose	Output
`batch_pcc()`	Batch partial correlation controlling for a third variable	Statistics table
`iobr_deg()`	DEG using DESeq2 or limma, with built-in volcano plot and heatmap	Statistics + figures
`sig_gsea()`	Gene Set Enrichment Analysis via fgsea, with built-in plots	Statistics + figures
`subgroup_survival()`	Subgroup Cox analysis by covariates	Statistics table
`LR_cal()`	Ligand-receptor interaction analysis	Statistics + figures
`find_mutations()`	Mutation-TME interaction (TCGA only)	Statistics table
`feature_manipulation()`	Clean features: remove NA, outliers, zero-variance	Cleaned data

Phase 5: Visualization

Script: 05-visualization.R

ALL TME data MUST be z-score scaled before visualization. Statistical result tables are in 04-figs/data/.

Every figure MUST:

Use IOBR's built-in visualization functions whenever available
Export as both PNG (300dpi) + PDF
Follow the strict ordering below

IMPORTANT: Add pdf(NULL) at the top of the visualization script to suppress stray Rplots.pdf from IOBR internal plot calls. Also clean up at the end: if (file.exists("Rplots.pdf")) file.remove("Rplots.pdf").

sig_box Standard Parameters (MUST use for ALL boxplots)

Every sig_box() call in this pipeline MUST use these fixed parameters:

sig_box(data = data, signature = feature_col, variable = group_col,
        palette = "paired1", show_pairwise_p = TRUE,
        angle_x_text = 60, hjust = 1,
        size_of_pvalue = 3, size_of_font = 5)

For multi-panel layouts, combine individual sig_box() plots with patchwork::wrap_plots().

Fig 01: CIBERSORT Barplot (22 cell types)

IMPORTANT: Only methods that produce cell FRACTION data (CIBERSORT, EPIC) should use cell_bar_plot(). ESTIMATE scores are NOT fractions.

tme_cb <- read.csv("03-tme/tme_cibersort.csv")

# Filter QC columns
qc_cols <- grep("^[Pp][.]", colnames(tme_cb), value = TRUE)
qc_cols <- c(qc_cols, grep("^Correlation", colnames(tme_cb), value = TRUE),
                  grep("^RMSE", colnames(tme_cb), value = TRUE))
cell_cols_cb <- setdiff(grep("_CIBERSORT$", colnames(tme_cb), value = TRUE), qc_cols)

p1 <- cell_bar_plot(input = tme_cb, id = "ID", features = cell_cols_cb,
                     title = "CIBERSORT TME Composition",
                     coord_flip = TRUE, palette = 4, legend.position = "right")
p1 <- p1 + guides(fill = guide_legend(ncol = 1)) +
     theme(legend.text = element_text(size = 5),
           legend.key.size = unit(2.5, "mm"),
           legend.title = element_blank(),
           plot.title = element_text(size = 10, hjust = 0.5))
if (nrow(tme_cb) > 50) {
  p1 <- p1 + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
}

ggsave("04-figs/Fig01-barplot_cibersort.png", p1, width = 180, height = 250, units = "mm", dpi = 300)
ggsave("04-figs/Fig01-barplot_cibersort.pdf", p1, width = 180, height = 250, units = "mm")

Fig 02: EPIC Barplot (8 cell types)

tme_ep <- read.csv("03-tme/tme_epic.csv")
epic_cols <- grep("_EPIC$", colnames(tme_ep), value = TRUE)

p2 <- cell_bar_plot(input = tme_ep, id = "ID", features = epic_cols,
                     title = "EPIC TME Composition",
                     coord_flip = TRUE, palette = 4, legend.position = "right")
p2 <- p2 + guides(fill = guide_legend(ncol = 1)) +
     theme(legend.text = element_text(size = 5),
           legend.key.size = unit(2.5, "mm"),
           legend.title = element_blank(),
           plot.title = element_text(size = 10, hjust = 0.5))

ggsave("04-figs/Fig02-barplot_epic.png", p2, width = 180, height = 250, units = "mm", dpi = 300)
ggsave("04-figs/Fig02-barplot_epic.pdf", p2, width = 180, height = 250, units = "mm")

Fig 03a-c: TME Heatmaps WITHOUT Subtype (z-score)

Cluster both rows and columns, no subtype annotation. Use pheatmap.

plot_heatmap_no_subtype <- function(data, cols, title, prefix,
                                     width = 8, height = 6) {
  mat <- t(as.matrix(data[, cols]))  # variables as rows, samples as columns
  mat[mat > 2] <- 2; mat[mat < -2] <- -2
  rownames(mat) <- sub("\\.[xy]$", "", rownames(mat))
  colnames(mat) <- data$ID

  for (fmt in c("pdf", "png")) {
    pheatmap::pheatmap(mat,
      color = colorRampPalette(c("#2166AC", "white", "#B2182B"))(100),
      show_colnames = FALSE, cluster_rows = TRUE, cluster_cols = TRUE,
      treeheight_row = 15, treeheight_col = 15, fontsize_row = 8,
      main = title,
      filename = paste0("04-figs/", prefix, ".", fmt),
      width = width, height = height, dpi = 300)
  }
}

# Fig03a: CIBERSORT
plot_heatmap_no_subtype(tme_scaled, cb_sc, "CIBERSORT TME Landscape",
                        "Fig03a-heatmap_cibersort")

# Fig03b: MCPcounter
mcp_sc <- grep("MCPcounter$", colnames(tme_scaled), value = TRUE)
plot_heatmap_no_subtype(tme_scaled, mcp_sc, "MCPcounter TME Landscape",
                        "Fig03b-heatmap_mcpcounter", width = 6, height = 5)

# Fig03c: xCell
xc_sc <- grep("xCell$", colnames(tme_scaled), value = TRUE)
plot_heatmap_no_subtype(tme_scaled, xc_sc, "xCell TME Landscape",
                        "Fig03c-heatmap_xcell", width = 10, height = 8)

Fig 04a-d: Survival Forest Plots (IF survival data exists)

USE sig_forest() — IOBR's built-in forest plot function. MUST filter extreme HR before plotting: remove variables with abs(HR) > 10 to prevent sig_forest() from producing blank plots (its coord_trans(x = "log2") only has 3 fixed breaks that break with extreme HR ranges).

if (length(time_cols) > 0 && length(status_cols) > 0) {
  time_col   <- time_cols[1]
  status_col <- status_cols[1]

  # Fig04a: CIBERSORT survival forest
  cb_surv <- batch_surv(pdata = tme_pdata, variable = cb_sc,
                        time = time_col, status = status_col)
  write.csv(cb_surv, file = "04-figs/data/05-surv_cibersort.csv", row.names = FALSE)
  cb_surv_filt <- cb_surv[abs(as.numeric(cb_surv$HR)) <= 10, ]
  if (nrow(cb_surv_filt) > 0) {
    p04a <- sig_forest(cb_surv_filt, signature = "ID", n = 20)
    ggsave("04-figs/Fig04a-forest_cibersort.png", p04a, width = 10, height = 6, units = "in", dpi = 300)
    ggsave("04-figs/Fig04a-forest_cibersort.pdf", p04a, width = 10, height = 6)
  }

  # Fig04b: MCPcounter + EPIC + xCell survival forest
  other_cols <- c(mcp_sc, epic_sc, xc_sc)
  other_surv <- batch_surv(pdata = tme_pdata, variable = other_cols,
                           time = time_col, status = status_col)
  write.csv(other_surv, file = "04-figs/data/05-surv_other_tme.csv", row.names = FALSE)
  other_surv_filt <- other_surv[abs(as.numeric(other_surv$HR)) <= 10, ]
  if (nrow(other_surv_filt) > 0) {
    p04b <- sig_forest(other_surv_filt, signature = "ID", n = 20)
    ggsave("04-figs/Fig04b-forest_other_tme.png", p04b, width = 10, height = 6, units = "in", dpi = 300)
    ggsave("04-figs/Fig04b-forest_other_tme.pdf", p04b, width = 10, height = 6)
  }

  # Fig04c: TME signatures survival forest
  tme_sig_surv <- batch_surv(pdata = tme_pdata, variable = tme_sig_cols,
                             time = time_col, status = status_col)
  write.csv(tme_sig_surv, file = "04-figs/data/05-surv_tme_signatures.csv", row.names = FALSE)
  tme_sig_surv_filt <- tme_sig_surv[abs(as.numeric(tme_sig_surv$HR)) <= 10, ]
  if (nrow(tme_sig_surv_filt) > 0) {
    p04c <- sig_forest(tme_sig_surv_filt, signature = "ID", n = 20)
    ggsave("04-figs/Fig04c-forest_tme_signature.png", p04c, width = 10, height = 6, units = "in", dpi = 300)
    ggsave("04-figs/Fig04c-forest_tme_signature.pdf", p04c, width = 10, height = 6)
  }

  # Fig04d: GO/KEGG/Hallmark survival forest
  pathway_surv <- batch_surv(pdata = tme_pdata, variable = pathway_cols,
                             time = time_col, status = status_col)
  write.csv(pathway_surv, file = "04-figs/data/05-surv_pathways.csv", row.names = FALSE)
  pathway_surv_filt <- pathway_surv[abs(as.numeric(pathway_surv$HR)) <= 10, ]
  if (nrow(pathway_surv_filt) > 0) {
    p04d <- sig_forest(pathway_surv_filt, signature = "ID", n = 20)
    ggsave("04-figs/Fig04d-forest_go_kegg.png", p04d, width = 10, height = 6, units = "in", dpi = 300)
    ggsave("04-figs/Fig04d-forest_go_kegg.pdf", p04d, width = 10, height = 6)
  }
}

Key notes:

MUST filter abs(HR) > 10 before sig_forest() — extreme HR values (e.g., 0.001–190) cause sig_forest() to produce blank plots because its internal coord_trans(x = "log2") only has 3 fixed axis breaks.
Save full (unfiltered) survival results to 04-figs/data/ — filtering is only for visualization.
Split into 4 groups: CIBERSORT, MCPcounter+EPIC+xCell, TME Signatures, GO/KEGG/Hallmark.

Fig 05: TME Cell Correlation Matrix Heatmap

Input: CIBERSORT + MCPcounter + ESTIMATE variables. Use pheatmap on Spearman correlation matrix.

load("04-figs/data/06-cor_matrix_full.RData")

# Deduplicate: keep .x (cibersort), drop .y (cibersort_abs)
keep <- !grepl("\\.[xy]$", rownames(cor_mat)) | grepl("\\.x$", rownames(cor_mat))
cor_mat <- cor_mat[keep, keep]
rownames(cor_mat) <- sub("\\.x$", "", rownames(cor_mat))
colnames(cor_mat) <- sub("\\.x$", "", colnames(cor_mat))

for (fmt in c("pdf", "png")) {
  pheatmap::pheatmap(cor_mat, fontsize = 6, fontsize_row = 5, fontsize_col = 5,
                      color = colorRampPalette(c("#2166AC", "white", "#B2182B"))(100),
                      border_color = NA, main = "TME Cell Correlation Matrix",
                      filename = paste0("04-figs/Fig05-cor_matrix.", fmt),
                      width = 8, height = 7, dpi = 300)
}

Fig 06a-d: TME Heatmaps BY Subtype + KM Plot

Heatmap dimension rules (MUST follow):

Element	Position	Parameter
Variables (cell types)	Rows (y-axis)	`rownames(mat)` — shows readable names
Samples	Columns (x-axis)	`show_colnames=FALSE`
Subtype color bar	Top (column annotation)	`annotation_col`
Subtype gaps	Column gaps	`gaps_col`
Clustering	Rows only	`cluster_rows=TRUE, cluster_cols=FALSE`

plot_subtype_heatmap <- function(data, cols, subtype_col, title, prefix,
                                  width = 8, height = 6) {
  data <- data[order(data[[subtype_col]]), ]
  mat <- t(as.matrix(data[, cols]))  # variables -> rows, samples -> columns
  mat[mat > 2] <- 2; mat[mat < -2] <- -2
  rownames(mat) <- sub("\\.[xy]$", "", rownames(mat))
  colnames(mat) <- data$ID

  ann_col <- data.frame(Subtype = data[[subtype_col]], row.names = data$ID)
  subtypes <- sort(unique(data[[subtype_col]]))
  sub_colors <- c("#E64B35", "#4DBBD5", "#00A087", "#3C5488", "#F39B7F", "#8491B6")
  ann_colors <- list(Subtype = setNames(sub_colors[seq_along(subtypes)], subtypes))
  gaps <- cumsum(table(data[[subtype_col]]))[-length(subtypes)]

  for (fmt in c("pdf", "png")) {
    pheatmap::pheatmap(mat, annotation_col = ann_col, annotation_colors = ann_colors,
      color = colorRampPalette(c("#2166AC", "white", "#B2182B"))(100),
      show_colnames = FALSE, cluster_rows = TRUE, cluster_cols = FALSE,
      treeheight_row = 15, gaps_col = gaps, fontsize_row = 8,
      main = title,
      filename = paste0("04-figs/", prefix, ".", fmt), width = width, height = height, dpi = 300)
  }
}

# Fig06a: CIBERSORT by subtype
hm_data_cb <- tme_scaled[, c("ID", cb_sc, "TME_subtype")]
plot_subtype_heatmap(hm_data_cb, cb_sc, "TME_subtype",
  "CIBERSORT TME by Subtype", "Fig06a-heatmap_cibersort_subtype")

# Fig06b: MCPcounter by subtype
hm_data_mcp <- tme_scaled[, c("ID", mcp_sc, "TME_subtype")]
plot_subtype_heatmap(hm_data_mcp, mcp_sc, "TME_subtype",
  "MCPcounter TME by Subtype", "Fig06b-heatmap_mcpcounter_subtype", width = 6, height = 5)

# Fig06c: xCell by subtype
xc_sc_sub <- intersect(xc_sc, colnames(tme_scaled))
hm_data_xc <- tme_scaled[, c("ID", xc_sc_sub, "TME_subtype")]
plot_subtype_heatmap(hm_data_xc, xc_sc_sub, "TME_subtype",
  "xCell TME by Subtype", "Fig06c-heatmap_xcell_subtype", width = 10, height = 8)

# Fig06d: KM plot by subtype (IF survival data exists)
# sig_surv_plot does NOT work with arbitrary categorical variables.
# Use survminer directly for KM by TME subtype.
if (length(time_cols) > 0 && length(status_cols) > 0) {
  library(survival)
  library(survminer)
  surv_data <- read.csv("04-figs/data/01-tme_pdata_merged.csv")
  dir.create("04-figs/kmplot", showWarnings = FALSE, recursive = TRUE)

  surv_obj <- Surv(time = surv_data[[time_col]], event = surv_data[[status_col]])
  km_fit <- survfit(surv_obj ~ TME_subtype, data = surv_data)
  p_km <- ggsurvplot(km_fit, data = surv_data,
    pval = TRUE, conf.int = FALSE,
    palette = c("#E64B35", "#4DBBD5", "#00A087", "#3C5488", "#F39B7F", "#8491B6"),
    legend.title = "TME Subtype", legend.labs = sort(unique(surv_data$TME_subtype)),
    xlab = "Time (months)", ylab = "Overall Survival",
    risk.table = TRUE, fontsize = 3, surv.median.line = "hv")

  pdf("04-figs/Fig06d-km_subtype.pdf", width = 7, height = 6)
  print(p_km)
  dev.off()
  png("04-figs/Fig06d-km_subtype.png", width = 7, height = 6, units = "in", res = 300)
  print(p_km)
  dev.off()
}

pheatmap notes:

Do NOT re-scale — data is already z-scored in Phase 4. Just cap at ±2.
treeheight_row = 15 — shrink clustering tree.
Must transpose — pheatmap needs variables as rows for readable y-axis labels.
cluster_cols = FALSE — columns ordered by subtype, clustering would destroy ordering.
CRITICAL: Do NOT use grep("CIBERSORT") in merged data — cibersort + cibersort_abs share cell names. Read from original cibersort CSV instead.

Fig 07a-e: Most Significant Positive Variable Per Subtype (sig_box + patchwork)

For each method group, find the most significant and positive variable for each TME subtype, create one sig_box per subtype, and combine with patchwork.

library(patchwork)

find_top_positive_per_subtype <- function(data, cell_cols, subtype_col = "TME_subtype") {
  subtypes <- sort(unique(data[[subtype_col]]))
  result <- character(0)
  for (st in subtypes) {
    st_data <- data[data[[subtype_col]] == st, cell_cols, drop = FALSE]
    cell_means <- colMeans(st_data, na.rm = TRUE)
    result[st] <- names(sort(cell_means, decreasing = TRUE))[1]
  }
  return(result)
}

make_subtype_boxplot_panel <- function(data, method_cols, subtype_col, title, filename) {
  top_cells <- find_top_positive_per_subtype(data, method_cols, subtype_col)
  subtypes <- names(top_cells)

  box_list <- lapply(subtypes, function(st) {
    col <- top_cells[st]
    clean_name <- sub("\\.[xy]$", "", col)
    sig_box(data = data, signature = col, variable = subtype_col,
            palette = "paired1", show_pairwise_p = TRUE,
            angle_x_text = 60, hjust = 1,
            size_of_pvalue = 3, size_of_font = 5) +
      ggtitle(paste0(st, ": ", clean_name)) +
      theme(axis.title = element_text(size = 6),
            axis.text.y = element_text(size = 5),
            plot.title = element_text(size = 7),
            plot.margin = margin(12, 5, 5, 5))
  })

  p <- wrap_plots(box_list, ncol = length(subtypes))
  ggsave(paste0("04-figs/", filename, ".png"), p,
         width = 50 * length(subtypes), height = 100, units = "mm", dpi = 300)
  ggsave(paste0("04-figs/", filename, ".pdf"), p,
         width = 50 * length(subtypes), height = 100, units = "mm")
}

# Fig07a: CIBERSORT
make_subtype_boxplot_panel(tme_scaled, cb_sc, "TME_subtype",
                           "CIBERSORT Subtype Boxplots", "Fig07a-boxplot_cibersort_subtype")

# Fig07b: MCPcounter
make_subtype_boxplot_panel(tme_scaled, mcp_sc, "TME_subtype",
                           "MCPcounter Subtype Boxplots", "Fig07b-boxplot_mcpcounter_subtype")

# Fig07c: xCell
make_subtype_boxplot_panel(tme_scaled, xc_sc, "TME_subtype",
                           "xCell Subtype Boxplots", "Fig07c-boxplot_xcell_subtype")

# Fig07d: TME signatures
# Identify TME signature columns from Phase 3 output
tme_sig_sc <- colnames(tme_scaled)[colnames(tme_scaled) %in% sig_tme_names]
if (length(tme_sig_sc) > 0) {
  make_subtype_boxplot_panel(tme_scaled, tme_sig_sc, "TME_subtype",
                             "TME Signatures Subtype Boxplots", "Fig07d-boxplot_tme_signature_subtype")
}

# Fig07e: GO/KEGG/Hallmark
pathway_sc <- grep("^HALLMARK_|^KEGG_|^GO_|^REACTOME_", colnames(tme_scaled), value = TRUE)
if (length(pathway_sc) > 0) {
  make_subtype_boxplot_panel(tme_scaled, pathway_sc, "TME_subtype",
                             "Pathway Subtype Boxplots", "Fig07e-boxplot_pathway_subtype")
}

Fig 08a-b: Top 10 Differential Boxplots (IF categorical variables exist)

For each binary variable: batch_wilcoxon() → top 10 → sig_box + patchwork (2×5). For each multi-level categorical variable: batch_kruskal() → top 10 → sig_box + patchwork (2×5).

pdata_summary <- read.csv("02-input/pdata_summary.csv")
cat_vars <- pdata_summary$var[pdata_summary$type %in% c("factor", "binary")]
fig_idx <- 0

if (length(cat_vars) > 0) {
  # Collect ALL TME variables (deconvolution + signatures + pathways)
  # MUST filter to NUMERIC columns only — batch_wilcoxon/batch_kruskal require numeric features
  pdata_col_set <- colnames(pdata)
  all_tme_vars <- setdiff(colnames(tme_pdata), pdata_col_set)
  all_tme_vars <- setdiff(all_tme_vars, "ID")
  # Keep only columns that are numeric
  all_tme_vars <- all_tme_vars[sapply(tme_pdata[, all_tme_vars], is.numeric)]

  for (v in cat_vars) {
    lvls <- unique(tme_pdata[[v]])
    lvls <- lvls[!is.na(lvls)]

    if (length(lvls) >= 2) {
      fig_idx <- fig_idx + 1
      fig_letter <- letters[fig_idx]  # a, b, c, ...

      if (length(lvls) == 2) {
        # Binary: Wilcoxon
        diff_res <- batch_wilcoxon(data = tme_pdata, target = v, feature = all_tme_vars)
      } else {
        # Multi-level: Kruskal
        diff_res <- batch_kruskal(data = tme_pdata, target = v, feature = all_tme_vars)
      }

      diff_res <- diff_res[order(diff_res$p.value), ]
      top10 <- head(diff_res, 10)
      write.csv(diff_res, file = paste0("04-figs/data/04-diff_", v, ".csv"), row.names = FALSE)

      if (nrow(top10) > 0) {
        box_list <- lapply(seq_len(nrow(top10)), function(i) {
          sig_box(data = tme_pdata, signature = top10$sig_names[i],
                  variable = v, palette = "paired1", show_pairwise_p = TRUE,
                  angle_x_text = 60, hjust = 1,
                  size_of_pvalue = 3, size_of_font = 5) +
            theme(axis.title = element_text(size = 5),
                  axis.text.y = element_text(size = 5),
                  plot.title = element_text(size = 5),
                  plot.margin = margin(12, 5, 5, 5))
        })

        p_combined <- wrap_plots(box_list, ncol = 5, nrow = 2)
        ggsave(paste0("04-figs/Fig08", fig_letter, "-top10_", v, ".png"), p_combined,
               width = 250, height = 220, units = "mm", dpi = 300)
        ggsave(paste0("04-figs/Fig08", fig_letter, "-top10_", v, ".pdf"), p_combined,
               width = 250, height = 220, units = "mm")
      }
    }
  }
}

Key notes:

Binary variable (2 levels): use batch_wilcoxon().
Multi-level variable (≥3 levels): use batch_kruskal().
Both return tibble with columns: sig_names, p.value, statistic, p.adj, log10pvalue, stars.
One figure per categorical variable: Fig08a for variable1, Fig08b for variable2, etc.

Fig 09: 8 KM Plots (IF survival data exists)

Select most positive (lowest HR, protective) and most negative (highest HR, risk) variable from each of 4 groups:

CIBERSORT → a (most positive), b (most negative)
Other TME methods → c, d
TME signatures → e, f
GO/KEGG/Hallmark → g, h

Total: up to 8 KM plots saved to 04-figs/kmplot/.

if (length(time_cols) > 0 && length(status_cols) > 0) {
  dir.create("04-figs/kmplot", showWarnings = FALSE, recursive = TRUE)

  find_extreme_hr <- function(surv_res, group_cols) {
    if (is.null(surv_res) || nrow(surv_res) == 0) return(character(0))
    group_res <- surv_res[surv_res$ID %in% group_cols, ]
    if (nrow(group_res) == 0) return(character(0))
    group_res <- group_res[complete.cases(group_res[, c("HR", "P")]), ]
    if (nrow(group_res) == 0) return(character(0))
    # Most positive (lowest HR = protective), most negative (highest HR = risk)
    most_pos <- group_res$ID[which.min(group_res$HR)]
    most_neg <- group_res$ID[which.max(group_res$HR)]
    return(c(most_pos, most_neg))
  }

  # Collect all survival results
  surv_all <- rbind(cb_surv, other_surv, tme_sig_surv, pathway_surv)
  km_vars <- character(0)

  # Group 1: CIBERSORT
  km_vars <- c(km_vars, find_extreme_hr(surv_all, cb_sc))
  # Group 2: Other TME methods
  other_sc <- c(mcp_sc, epic_sc, xc_sc)
  km_vars <- c(km_vars, find_extreme_hr(surv_all, other_sc))
  # Group 3: TME signatures
  km_vars <- c(km_vars, find_extreme_hr(surv_all, tme_sig_cols))
  # Group 4: GO/KEGG/Hallmark
  km_vars <- c(km_vars, find_extreme_hr(surv_all, pathway_cols))

  km_vars <- unique(km_vars[nchar(km_vars) > 0])

  surv_data <- read.csv("04-figs/data/01-tme_pdata_merged.csv")

  for (i in seq_along(km_vars)) {
    sig_surv_plot(input_pdata = surv_data, signature = km_vars[i],
                  time = time_col, status = status_col, time_type = "month",
                  palette = "jama", mini_sig = "score",
                  fig.type = "pdf", save_path = "04-figs/kmplot",
                  project = paste0("KM_", i), index = i)
  }
}

Key notes:

sig_surv_plot() saves its own output — Do NOT wrap with ggsave().
KM plots saved to 04-figs/kmplot/ — each variable generates PDF + RData.
mini_sig = "score" — continuous variable, auto-split by median.
time_type: "day", "month", or "year".

Clean up stray Rplots.pdf (generated by some IOBR internal plot calls)

Suppress with `pdf(NULL)` at top of visualization script

if (file.exists("Rplots.pdf")) file.remove("Rplots.pdf")

Additional Visualization Functions

Use these as needed based on analysis requirements:

Function	Purpose	Output
`roc_time()`	Time-dependent ROC with AUC at multiple time points	Figure (self-saving)
`sig_roc()`	Multiple ROC curves for binary response prediction	Figure
`batch_sig_surv_plot()`	Batch KM curves for multiple signatures across projects	Figures (self-saving)
`surv_group()`	KM curve with risk table (High vs Low)	Figure
`subgroup_survival()`	Subgroup Cox analysis by covariates	Statistics table
`get_cor()`	Single pairwise correlation with scatter plot + regression line	Figure
`iobr_cor_plot()`	Batch correlation visualization (signature vs signature/phenotype)	Figure
`iobr_pca()`	PCA dimensionality reduction with scatter plot	Figure
`iobr_deg()`	DEG with built-in volcano plot and heatmap output	Statistics + figures
`sig_gsea()`	Gene Set Enrichment Analysis via fgsea, with built-in plots	Statistics + figures
`LR_cal()`	Ligand-receptor interaction analysis	Statistics + figures

# Time-dependent ROC — saves its own output
roc_time(input = tme_pdata, vars = c("T_cells_CD8_CIBERSORT", "TMEscoreA_CIR"),
         time = "OS_time", status = "OS_status",
         time_point = c(12, 36, 60), time_type = "month", palette = "jama",
         path = "04-figs", fig.type = "png", index = 1)

# PCA visualization
iobr_pca(data = eset, pdata = pdata, group = "TME_subtype", scale = TRUE)

# Single correlation scatter plot
get_cor(data = tme_pdata, variable1 = "T_cells_CD8_CIBERSORT",
        variable2 = "CD_8_T_effector", method = "spearman")

IOBR Palettes

Categorical (palette_group parameter):

Name	Best For
`"jama"`	Default, 2–5 groups
`"jco"`	Clinical oncology
`"nrc"`	Nature publications
`"aaas"`	Science journal
`"accent"`	Colorblind-friendly
`"set2"`	Many groups

Heatmap (palette parameter, numeric 1–6):

#	Style
2	Blue-white-red diverging (default)
1	Classic pheatmap
3	Purple-orange
4–6	Sequential gradients

Cell bar plot (cell_bar_plot() palette parameter, numeric 1–4):

#	Style
4	Recommended — rich random palette with many distinct colors
1–3	Alternative random palettes

If unsure which palette, present 3 options to the user and ask.

User Decision Points

Pause and ask at:

Phase 0: Review and approve analysis plan (IOBR-pipeline.md)
Phase 1: Data type (Count/TPM/Array) and species (hsa/mmu)
Phase 2: Which deconvolution method(s)
Phase 3: Scoring method and which signature collection
Phase 5: Color palette preference

Phase 6: Wrap-Up Note (MANDATORY — execute after ALL other phases complete)

Do NOT skip this phase. After Phase 5 visualization finishes and all log files are verified, you MUST:

List actual files in 04-figs/, 03-tme/, 02-input/, and 04-figs/data/ to build the tree from reality
Write 05-note/IOBR-analysis-README.md with:
- IOBR citation header at top
- ASCII tree diagram showing ACTUAL outputs (not template placeholders)
- Brief interpretation of key findings (deconvolution method count, subtype count, top survival hits, available figures)
Count figures, tables, and log files — report these numbers in the summary

Key Findings to Extract (auto-detect from output files)

Before writing the note, scan the output:

# Count figures generated
echo "=== Figures ===" && ls 04-figs/Fig*.png | wc -l

# Deconvolution methods succeeded
echo "=== TME CSVs ===" && ls 03-tme/tme_*.csv

# Survival data availability
echo "=== Survival ===" && ls 04-figs/data/05-surv_*.csv 2>/dev/null || echo "No survival data"

# Subtype info
echo "=== Subtypes ===" && head -5 04-figs/data/03-tme_subtype.csv

# Log file check
echo "=== Logs ===" && wc -l 06-log/*.log

Template for `05-note/IOBR-analysis-README.md`

# IOBR Citation:
# Zeng DQ, Fang YR, … , Yu GC, Liao WJ.
# Enhancing Immuno-Oncology Investigations Through Multidimensional Decoding of
# Tumour Microenvironment with IOBR 2.0, Cell Reports Methods, 2024
# https://doi.org/10.1016/j.crmeth.2024.100910

# <Project Name> — TME Analysis Summary

## Analysis Summary

- Dataset: <N> samples, <data_type>
- Deconvolution: <methods used, comma-separated>
- Signature scoring: <method>
- TME subtypes: <N> subtypes identified (K-means on CIBERSORT)

## Key Findings

<3-5 bullet points summarizing the most important results>

## Output Directory Tree

project/ ├── 01-script/ │ ├── 01-data_preprocessing.R │ ├── 02-tme_deconvolution.R │ ├── 03-signature_analysis.R │ ├── 04-statistical_analysis.R │ └── 05-visualization.R ├── 02-input/ │ ├── annotated_eset.csv │ ├── pdata.csv │ └── pdata_summary.csv ├── 03-tme/ │ ├── tme_cibersort.csv # CIBERSORT: 22 cell types │ ├── tme_mcpcounter.csv # MCPcounter: 8 cell populations │ ├── tme_estimate.csv # ESTIMATE: 4 scores │ ├── sig_score_tme.csv # TME signature scores (ssGSEA) │ └── tme_sig_combine.csv # Merged master table ├── 04-figs/ │ ├── Fig01-barplot_cibersort.png/pdf │ ├── Fig02-barplot_epic.png/pdf │ ├── Fig03a-heatmap_cibersort.png/pdf │ ├── Fig03b-heatmap_mcpcounter.png/pdf │ ├── Fig03c-heatmap_xcell.png/pdf │ ├── Fig04a-forest_cibersort.png/pdf # (IF survival) │ ├── Fig04b-forest_other_tme.png/pdf # (IF survival) │ ├── Fig04c-forest_tme_signature.png/pdf # (IF survival) │ ├── Fig04d-forest_go_kegg.png/pdf # (IF survival) │ ├── Fig05-cor_matrix.png/pdf │ ├── Fig06a-heatmap_cibersort_subtype.png/pdf │ ├── Fig06b-heatmap_mcpcounter_subtype.png/pdf │ ├── Fig06c-heatmap_xcell_subtype.png/pdf │ ├── Fig06d-km_subtype.png/pdf # (IF survival) │ ├── Fig07a-boxplot_cibersort_subtype.png/pdf │ ├── Fig07b-boxplot_mcpcounter_subtype.png/pdf │ ├── Fig07c-boxplot_xcell_subtype.png/pdf │ ├── Fig07d-boxplot_tme_signature_subtype.png/pdf │ ├── Fig07e-boxplot_pathway_subtype.png/pdf │ ├── Fig08a-top10_.png/pdf # (IF categorical) │ ├── Fig08b-top10_.png/pdf # (IF categorical) │ ├── kmplot/ # KM plots (IF survival) │ └── data/ # Statistical result tables ├── 05-note/ │ ├── IOBR-pipeline.md │ └── IOBR-analysis-README.md └── 06-log/ ├── 01-data_preprocessing.log ├── 02-tme_deconvolution.log ├── 03-signature_analysis.log ├── 04-statistical_analysis.log └── 05-visualization.log

After writing, present a one-paragraph summary to the user listing figure count, log status, and key findings.

06-log Directory (MANDATORY)

Every R script MUST be executed with log capture. No exceptions.

One log file per script, named to match: 01-data_preprocessing.log, 02-tme_deconvolution.log, ...
Each log MUST begin with the IOBR citation header (same as scripts).
Logs capture both stdout and stderr for debugging and reproducibility.
After running all scripts, verify: ls -la 06-log/ should show 5 non-empty log files.
If a script fails, check the log file for the error message.

Run EVERY script with this pattern:

mkdir -p 06-log

# Phase 1
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/01-data_preprocessing.R; } > 06-log/01-data_preprocessing.log 2>&1

# Phase 2
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/02-tme_deconvolution.R; } > 06-log/02-tme_deconvolution.log 2>&1

# Phase 3
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/03-signature_analysis.R; } > 06-log/03-signature_analysis.log 2>&1

# Phase 4
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/04-statistical_analysis.R; } > 06-log/04-statistical_analysis.log 2>&1

# Phase 5
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/05-visualization.R; } > 06-log/05-visualization.log 2>&1

Quick Start Checklist

References (in this skill directory)

references/functions.md — IOBR function parameter reference
references/palettes.md — Color palette selection guide
references/iobr_built_in_data.md — Complete catalog of built-in signatures, annotations, and datasets
references/iobr_pipeline_template.R — Full production pipeline template (9-method deconvolution + signature + pathway scoring)

Read these when you need specific parameter details, palette options, want to know what signatures are available, or need a complete pipeline template to adapt.

Uh oh!

FilesExpand file tree

SKILL.md

Latest commit

History

SKILL.md

File metadata and controls

IOBRskill — Bulk Transcriptome Tumor Microenvironment Analysis

IOBR Citation (MUST include in every R script and log)

Prerequisites

Environment Check (first step)

Required Dependencies

IOBR Built-in Data

On-demand pathway data via load_data()

Project Directory Structure

Analysis Pipeline

Phase 0: Plan Generation

Phase 1: Data Preprocessing and Annotation

Confirm with the user:

Sample matching (if both eset and pdata exist):

pdata variable summary:

Key functions:

Phase 2: TME Deconvolution

Single-method usage:

Full 9-method deconvolution + merge (production workflow):

Critical caveats:

Phase 3: Signature Scoring and Pathway Analysis

Scoring methods (ask user to choose):

Step 3a: TME / IO biomarker signature scoring:

Step 3b: Pathway scoring (Hallmark + GO + KEGG + Reactome):

Step 3c: Merge all results into master table:

Key notes:

TME subtype clustering:

Phase 4: Statistical Analysis

4a. Data Preparation

4b. TME Subtyping

4c. Statistical Testing (conditional)

4d. Correlation

4e. Other Statistical Functions

Phase 5: Visualization

sig_box Standard Parameters (MUST use for ALL boxplots)

Fig 01: CIBERSORT Barplot (22 cell types)

Fig 02: EPIC Barplot (8 cell types)

Fig 03a-c: TME Heatmaps WITHOUT Subtype (z-score)

Fig 04a-d: Survival Forest Plots (IF survival data exists)

Fig 05: TME Cell Correlation Matrix Heatmap

Fig 06a-d: TME Heatmaps BY Subtype + KM Plot

Fig 07a-e: Most Significant Positive Variable Per Subtype (sig_box + patchwork)

Fig 08a-b: Top 10 Differential Boxplots (IF categorical variables exist)

Fig 09: 8 KM Plots (IF survival data exists)

Clean up stray Rplots.pdf (generated by some IOBR internal plot calls)

Suppress with pdf(NULL) at top of visualization script

Additional Visualization Functions

IOBR Palettes

User Decision Points

Phase 6: Wrap-Up Note (MANDATORY — execute after ALL other phases complete)

Key Findings to Extract (auto-detect from output files)

Template for 05-note/IOBR-analysis-README.md

06-log Directory (MANDATORY)

Quick Start Checklist

References (in this skill directory)

On-demand pathway data via `load_data()`

Suppress with `pdf(NULL)` at top of visualization script

Template for `05-note/IOBR-analysis-README.md`