Skip to content

Latest commit

 

History

History
1311 lines (1051 loc) · 55.4 KB

File metadata and controls

1311 lines (1051 loc) · 55.4 KB
name IOBRskill
description End-to-end bulk RNA-seq / microarray tumor microenvironment (TME) analysis using the IOBR R package. Covers data preprocessing, gene annotation, TME deconvolution (CIBERSORT/MCPcounter/EPIC/xCell/ESTIMATE/TIMER/quanTIseq/IPS), signature scoring (PCA/ssGSEA/Z-score), ligand-receptor analysis, pathway enrichment, survival modeling, and Nature-style visualization. Use this skill when the user mentions: tumor microenvironment analysis, TME deconvolution, immune infiltration analysis, 肿瘤微环境分析, 肿瘤微环境解析, 免疫浸润分析, ligand-receptor analysis, 受体配体分析, IOBR, immune cell deconvolution, CIBERSORT, MCPcounter, ESTIMATE, or any bulk RNA-seq immuno-oncology workflow. Also trigger on /IOBRskill.

IOBRskill — Bulk Transcriptome Tumor Microenvironment Analysis

IOBR Citation (MUST include in every R script and log)

Every R script generated by this skill MUST begin with this citation header:

###############################################################################
# IOBR Citation:
# Zeng DQ, Fang YR, … , Yu GC, Liao WJ.
# Enhancing Immuno-Oncology Investigations Through Multidimensional Decoding of
# Tumour Microenvironment with IOBR 2.0, Cell Reports Methods, 2024
# https://doi.org/10.1016/j.crmeth.2024.100910
###############################################################################

The same citation header MUST appear at the top of every log file in 06-log/.


Prerequisites

Environment Check (first step)

Verify IOBR is installed in the base conda environment:

source /home/bio/miniconda3/etc/profile.d/conda.sh
conda activate base
Rscript -e 'packageVersion("IOBR")'

If IOBR is not installed:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("IOBR/IOBR")

Required Dependencies

IOBR relies on several packages that are NOT auto-installed. Install these before starting:

# For CIBERSORT
BiocManager::install("preprocessCore")

# For quanTIseq deconvolution
install.packages("limSolve", repos = "https://cloud.r-project.org")

# For tme_cluster()
install.packages("NbClust", repos = "https://cloud.r-project.org")

# For iobr_pca()
install.packages("FactoMineR", repos = "https://cloud.r-project.org")

# For sig_heatmap() — only if using tidyHeatmap approach (pheatmap preferred)
# install.packages("tidyHeatmap", repos = "https://cloud.r-project.org")

IOBR Built-in Data

IOBR ships with rich built-in data — you rarely need external annotation files. For full details, read references/iobr_built_in_data.md. Key resources:

Resource Access What It Contains
Gene length / ID mapping count2tpm() auto-loads 111 genes with HUGO, ENTREZID, ENSEMBL
Microarray probe sets extdata/probesets.txt 176 probes mapped to cell types
TME signatures (186) IOBR:::signature_tme Immune, stromal, EMT, checkpoint signatures
Full collection (323) data("signature_collection") All signatures + Hallmark + Metabolism
Signature groupings data("sig_group") 35+ categories for filtering signatures
Pathway data (on-demand) IOBR::load_data() hallmark (50), go_bp (7658), kegg (186), reactome (1615)
Example datasets data("pdata_stad") etc. TCGA-STAD, IMvigor210 toy data

On-demand pathway data via load_data()

IOBR provides IOBR::load_data() to download and cache pathway gene sets from GitHub on first use. These are essential for comprehensive pathway scoring:

hallmark  <- IOBR::load_data("hallmark")   # 50 pathways
go_bp     <- IOBR::load_data("go_bp")      # 7,658 gene sets
go_cc     <- IOBR::load_data("go_cc")
go_mf     <- IOBR::load_data("go_mf")
kegg      <- IOBR::load_data("kegg")       # 186 pathways
reactome  <- IOBR::load_data("reactome")   # 1,615 pathways

Project Directory Structure

Create this structure before starting analysis:

project/
├── 01-script/         # R scripts (numbered, each with IOBR citation header)
├── 02-input/          # Normalized and annotated expression matrices
│   └── pdata_summary.csv          # Phenotype variable summary
├── 03-tme/            # TME deconvolution and signature scoring results
├── 04-figs/
│   ├── *.png/pdf      # All figures (PNG 300dpi + PDF, numbered naming)
│   ├── kmplot/        # KM survival plots (auto-saved by sig_surv_plot)
│   └── data/          # Statistical result tables matching figures
├── 05-note/
│   ├── IOBR-pipeline.md           # Analysis plan (generated in Phase 0)
│   └── IOBR-analysis-README.md    # Final analysis summary
└── 06-log/            # R execution logs (one per script, with citation header)

Naming conventions:

  • Scripts: 01-data_preprocessing.R, 02-tme_deconvolution.R, ...
  • Figures: Fig01-barplot_cibersort.png + Fig01-barplot_cibersort.pdf (always dual format)
  • Statistical tables: 04-figs/data/01-tme_pdata_merged.csv, 04-figs/data/04-wilcoxon_results.csv, ...
  • Logs: 01-data_preprocessing.log, 02-tme_deconvolution.log, ...
  • Note: IOBR-analysis-README.md (tree diagram + brief interpretation)

When running a script, capture the log:

Rscript 01-script/01-data_preprocessing.R > 06-log/01-data_preprocessing.log 2>&1

Analysis Pipeline

6 sequential phases (Phase 0–5). Pause at decision points to let the user choose.

Phase 0: Plan Generation

Before writing any analysis code, generate an analysis plan document.

After confirming data type / species / deconvolution methods with the user (Phase 1 questions), create 05-note/IOBR-pipeline.md with:

  1. Project metadata: project name, data source, data type, species
  2. ASCII tree diagram of all scripts, their inputs/outputs, and expected results:
01-data_preprocessing.R
  ├─ Input:  raw expression matrix + phenotype data
  ├─ Output: 02-input/annotated_eset.csv, 02-input/pdata.csv
  └─ Notes:  log2 transformation, gene ID conversion

02-tme_deconvolution.R
  ├─ Input:  02-input/annotated_eset.csv
  ├─ Output: 03-tme/tme_cibersort.csv, 03-tme/tme_mcpcounter.csv, ...
  └─ Notes:  CIBERSORT + MCPcounter + ESTIMATE (linear-scale data)

03-signature_analysis.R
  ├─ Input:  02-input/annotated_eset.csv
  ├─ Output: 03-tme/sig_score_tme.csv, 03-tme/sig_score_pathway.RData
  └─ Notes:  ssGSEA for TME signatures, Hallmark + KEGG pathways

04-statistical_analysis.R
  ├─ Input:  03-tme/*.csv + 02-input/pdata.csv
  ├─ Output: 04-figs/data/01-tme_pdata_merged.csv
  │          04-figs/data/02-tme_scaled.csv
  │          04-figs/data/03-tme_subtype.csv
  │          04-figs/data/04-wilcoxon_results.csv
  │          04-figs/data/05-surv_results.csv
  │          04-figs/data/06-cor_matrix.csv
  └─ Notes:  merge → scale → cluster → wilcoxon → survival → correlation

05-visualization.R
  ├─ Input:  04-figs/data/*.csv + 03-tme/*.csv
  ├─ Output: 04-figs/Fig01-09 (PNG + PDF)
  └─ Notes:  barplots → heatmaps → correlation → boxplots → forest → KM
  1. Present to user: "开始分析" or "调整计划"
  2. Only after approval, begin Phase 1

Phase 1: Data Preprocessing and Annotation

Script: 01-data_preprocessing.R

Confirm with the user:

  • Data source: User matrix, GEO (GSE#####), or TCGA
  • Data type: "Count" (raw RNA-seq), "Tpm" (already normalized), or "Array" (microarray)
  • Species: "hsa" (human) or "mmu" (mouse)

Sample matching (if both eset and pdata exist):

Before any analysis, verify sample overlap between expression matrix and phenotype data:

# Report sample counts
eset_samples  <- colnames(eset)
pdata_samples <- pdata$ID   # or rownames(pdata)
intersection  <- intersect(eset_samples, pdata_samples)

cat("Expression matrix samples:", length(eset_samples), "\n")
cat("Phenotype data samples:",    length(pdata_samples), "\n")
cat("Matched samples:",           length(intersection), "\n")

# Ask user: use all samples, intersection only, or check data?
# Then filter both to agreed sample set
eset  <- eset[, intersection]
pdata <- pdata[pdata$ID %in% intersection, ]

pdata variable summary:

After loading pdata, generate a summary report:

# Generate pdata_summary.csv
pdata_summary <- data.frame(
  var       = character(),
  type      = character(),
  n_levels  = integer(),
  levels    = character(),
  stringsAsFactors = FALSE
)

for (col in colnames(pdata)) {
  vals <- pdata[[col]]
  if (is.numeric(vals)) {
    tp <- ifelse(length(unique(vals)) <= 2, "binary", "numeric")
    lvls <- paste(range(vals, na.rm = TRUE), collapse = " - ")
    n_lv <- length(unique(vals))
  } else {
    tp <- "factor"
    lvls <- paste(unique(vals)[1:min(10, length(unique(vals)))], collapse = ", ")
    n_lv <- length(unique(vals))
  }
  pdata_summary <- rbind(pdata_summary, data.frame(
    var = col, type = tp, n_levels = n_lv, levels = lvls
  ))
}

write.csv(pdata_summary, file = "02-input/pdata_summary.csv", row.names = FALSE)
cat(">>>>> pdata summary saved to 02-input/pdata_summary.csv\n")

This summary helps decide which statistical tests to apply in Phase 4.

Key functions:

Function Purpose When to Use
count2tpm() Count → TPM + gene ID mapping Raw RNA-seq counts
log2eset() Auto-detect and apply log2 transform Ensures log2 scale for downstream analysis
anno_eset() Probe → gene symbol annotation Microarray data or unannotated matrices
remove_batcheffect() ComBat / ComBat-seq batch correction Multi-batch datasets
find_outlier_samples() WGCNA-based outlier detection QC step
iobr_pca() PCA visualization for batch assessment QC step
mouse2human_eset() Mouse → human gene ortholog conversion Mouse data
# RNA-seq: counts → TPM
eset <- count2tpm(countMat = expr_matrix, idType = "Ensembl", org = "hsa")

# Microarray: probe annotation
eset <- anno_eset(eset = expr_matrix, annotation = anno_df,
                  symbol = "gene_symbol", probe = "probe_id", method = "mean")

# Ensure log2 scale (auto-detects — only transforms if not already log2)
eset <- log2eset(eset)

Save output:

write.csv(eset,  file = "02-input/annotated_eset.csv")
write.csv(pdata, file = "02-input/pdata.csv")

Phase 2: TME Deconvolution

Script: 02-tme_deconvolution.R

Ask the user which method(s) to use.

Method Cell Types Best For
"cibersort" 22 (LM22) Gold standard, comprehensive
"cibersort_abs" 22 (LM22) CIBERSORT absolute mode (fraction-independent)
"mcpcounter" 8 populations Quick, stromal + immune
"epic" 8 incl. cancer cells Tumor purity
"xcell" 64 types Broadest coverage (slow)
"estimate" 4 scores Stromal/Immune/Tumor purity
"timer" 6 types TCGA cancer-specific
"quantiseq" 10 incl. M1/M2 Macrophage polarization
"ips" 4 axes Immunotherapy response

Recommended: CIBERSORT + MCPcounter + ESTIMATE (most cited). For comprehensive analysis, run all 9 methods and merge.

Single-method usage:

tme_result <- deconvo_tme(eset = eset, method = "cibersort", arrays = FALSE, perm = 1000)
write.csv(tme_result, file = "03-tme/tme_cibersort.csv")

Full 9-method deconvolution + merge (production workflow):

# Each method may take 1-5 minutes depending on sample size
# Use tryCatch for each method — some may fail on certain datasets
cibersort <- tryCatch(deconvo_tme(eset = eset, method = "cibersort",
                                   arrays = FALSE, perm = 1000), error = function(e) NULL)
cibersort_abs <- tryCatch(deconvo_tme(eset = eset, method = "cibersort_abs",
                                       arrays = FALSE, perm = 1000), error = function(e) NULL)
epic      <- tryCatch(deconvo_tme(eset = eset, method = "epic",
                                   arrays = FALSE), error = function(e) NULL)
mcp       <- tryCatch(deconvo_tme(eset = eset, method = "mcpcounter"), error = function(e) NULL)
xcell     <- tryCatch(deconvo_tme(eset = eset, method = "xcell",
                                   arrays = FALSE), error = function(e) NULL)
estimate  <- tryCatch(deconvo_tme(eset = eset, method = "estimate"), error = function(e) NULL)
timer     <- tryCatch(deconvo_tme(eset = eset, method = "timer",
                                   group_list = rep(tumor_type, ncol(eset))), error = function(e) NULL)
quantiseq <- tryCatch(deconvo_tme(eset = eset, method = "quantiseq",
                                   tumor = TRUE, arrays = FALSE, scale_mrna = TRUE), error = function(e) NULL)
ips       <- tryCatch(deconvo_tme(eset = eset, method = "ips", plot = FALSE), error = function(e) NULL)

# Merge all results by sample ID using base R merge()
results_list <- list(cibersort, cibersort_abs, mcp, xcell, epic, estimate, quantiseq, timer, ips)
results_list <- results_list[!sapply(results_list, is.null)]

tme_combine <- results_list[[1]]
for (i in seq_along(results_list)[-1]) {
  tme_combine <- merge(tme_combine, results_list[[i]], by = "ID")
}

save(tme_combine, file = "03-tme/tme_combine.RData")
write.csv(tme_combine, file = "03-tme/tme_combine.csv", row.names = FALSE)

Critical caveats:

  1. All deconvolution methods (except IPS) need linear-scale data — Anti-log first: eset_linear <- 2^eset if log2-transformed. CIBERSORT, MCPcounter, ESTIMATE, EPIC, quanTIseq, xCell all warn "Data values appear small (<50)" when fed log2 data. Only IPS works with log2 data directly.
  2. Column names have method suffixT_cells_CD8_CIBERSORT, T_cells_MCPcounter, etc.
  3. Output CSV has ID column — Set rownames: rownames(df) <- df$ID; df$ID <- NULL.
  4. arrays = TRUE for microarray; perm = 1000 recommended for CIBERSORT.
  5. Required packagespreprocessCore (CIBERSORT), limSolve (quanTIseq). Install before running.
  6. TIMER requires group_list — Provide cancer type per sample: rep(tumor_type, ncol(eset)).
  7. quanTIseq parameters — Use tumor = TRUE, arrays = FALSE, scale_mrna = TRUE for tumor samples.
  8. Use tryCatch for each method — some methods may fail on certain datasets. Wrap each call to avoid losing partial results.

Phase 3: Signature Scoring and Pathway Analysis

Script: 03-signature_analysis.R

Scoring methods (ask user to choose):

Method Best For
"ssgsea" Default, robust
"pca" Continuous scores, large gene sets
"zscore" Simple, fast
"integration" Combined approach

Step 3a: TME / IO biomarker signature scoring:

# Option A: TME signatures (186 sets, internal — use :::)
signature_tme <- IOBR:::signature_tme
sig_tme <- calculate_sig_score(pdata = NULL, eset = eset,
                                signature = signature_tme,
                                method = "ssgsea",
                                mini_gene_count = 3,
                                adjust_eset = TRUE)
write.csv(sig_tme, file = "03-tme/sig_score_tme.csv")

# Option B: Full collection (323 sets) — for comprehensive analysis
data("signature_collection")
sig_res <- calculate_sig_score(pdata = NULL, eset = eset,
                                signature = signature_collection,
                                method = "pca",
                                mini_gene_count = 2,
                                adjust_eset = TRUE)
write.csv(sig_res, file = "03-tme/sig_score_collection.csv")

Step 3b: Pathway scoring (Hallmark + GO + KEGG + Reactome):

hallmark  <- IOBR::load_data("hallmark")    # 50 pathways
go_bp     <- IOBR::load_data("go_bp")       # 7,658 gene sets
go_cc     <- IOBR::load_data("go_cc")
go_mf     <- IOBR::load_data("go_mf")
kegg      <- IOBR::load_data("kegg")        # 186 pathways
reactome  <- IOBR::load_data("reactome")    # 1,615 pathways

sig_pathway <- calculate_sig_score(pdata = NULL, eset = eset,
                                    signature = c(hallmark, go_bp, go_cc, go_mf, kegg, reactome),
                                    method = "ssgsea",
                                    mini_gene_count = 3)
write.csv(sig_pathway, file = "03-tme/sig_score_pathway.csv")

Step 3c: Merge all results into master table:

tme_sig_combine <- merge(tme_combine, sig_res, by = "ID")
tme_sig_combine <- merge(tme_sig_combine, sig_pathway, by = "ID")
write.csv(tme_sig_combine, file = "03-tme/tme_sig_combine.csv", row.names = FALSE)

Key notes:

  1. pdata = NULL, adjust_eset = TRUE — This is the correct production pattern for calculate_sig_score(). Do NOT set adjust_eset = FALSE unless data is already properly scaled.
  2. Method names are lowercase — Use "pca", "ssgsea", "zscore", "integration" (NOT uppercase "PCA").
  3. signature_tme is internal — Access via IOBR:::signature_tme (triple colon).
  4. signature_collection needs data() — Call data("signature_collection") before use.
  5. IOBR::load_data() downloads on first use — Caches locally for subsequent calls.
  6. Pathway scoring is computationally heavy — Consider running Hallmark (50) and KEGG (186) first for quick results, then GO/Reactome for comprehensive analysis.
  7. Use base R merge() for joining — Not all environments have tidyverse. Prefer merge(df1, df2, by = "ID") over %>% inner_join().
  8. tme_cluster() requires NbClust — Install: install.packages("NbClust").

Use references/iobr_built_in_data.md for the full signature catalog and data("sig_group") for filtering by category.

TME subtype clustering:

tme_clusters <- tme_cluster(input = tme_result, features = cell_types,
                             id = "sample_id", method = "kmeans")

Phase 4: Statistical Analysis

Script: 04-statistical_analysis.R

This phase produces structured output: every statistical result is saved to 04-figs/data/ as a numbered CSV file for traceability and reuse in Phase 5 visualization.

4a. Data Preparation

Merge TME results with pdata, then scale all TME cell columns:

# Load data
pdata <- read.csv("02-input/pdata.csv")
tme   <- read.csv("03-tme/tme_combine.csv")

# Merge pdata with TME results
tme_pdata <- merge(pdata, tme, by = "ID")
write.csv(tme_pdata, file = "04-figs/data/01-tme_pdata_merged.csv", row.names = FALSE)

# Collect TME cell columns — MUST match actual column names in merged data
# After merge, cibersort + cibersort_abs shared names get .x/.y suffixes
# So match by checking actual colname OR colname-without-suffix against original list
qc_patterns <- c("^P[.]", "^Correlation", "^RMSE")
all_tme_cell_cols <- colnames(tme_pdata)[
  colnames(tme_pdata) %in% original_cell_cols |
  gsub("\\.[xy]$", "", colnames(tme_pdata)) %in% original_cell_cols
]
all_tme_cell_cols <- setdiff(all_tme_cell_cols, "ID")

# Z-score scale all TME cell columns (per column = per variable)
tme_scaled <- tme_pdata
tme_scaled[, all_tme_cell_cols] <- scale(tme_scaled[, all_tme_cell_cols])
write.csv(tme_scaled, file = "04-figs/data/02-tme_scaled.csv", row.names = FALSE)

4b. TME Subtyping

Cluster samples by TME cell composition using CIBERSORT fractions:

# Extract CIBERSORT cell columns for clustering
cb_cols <- grep("_CIBERSORT$", colnames(tme_pdata), value = TRUE)
cb_cols <- setdiff(cb_cols, grep("P[.]value|Correlation|RMSE", cb_cols, value = TRUE))

# K-means clustering on scaled CIBERSORT data
set.seed(123)
tme_subtype <- tme_cluster(input = tme_scaled, features = cb_cols,
                            id = "ID", method = "kmeans", max.nc = 6)

# tme_cluster returns data.frame with columns: ID, cluster, <cell_cols>
# Extract cluster assignments and label as TME subtypes
subtype_df <- data.frame(ID = tme_subtype$ID, TME_subtype = paste0("TME", tme_subtype$cluster))
write.csv(subtype_df, file = "04-figs/data/03-tme_subtype.csv", row.names = FALSE)

# Find representative TME variables per subtype
markers <- find_markers_in_bulk(pdata = tme_scaled, eset = tme_scaled,
                                 group = "TME_subtype", nfeatures = 50)
write.csv(markers, file = "04-figs/data/03-tme_subtype_markers.csv", row.names = FALSE)

4c. Statistical Testing (conditional)

Run statistical tests based on what's available in pdata:

# --- Wilcoxon test (IF categorical variables exist in pdata) ---
# Check pdata_summary.csv for factor/binary variables
pdata_summary <- read.csv("02-input/pdata_summary.csv")
cat_vars <- pdata_summary$var[pdata_summary$type %in% c("factor", "binary")]

if (length(cat_vars) > 0) {
  for (v in cat_vars) {
    lvls <- unique(tme_pdata[[v]])
    lvls <- lvls[!is.na(lvls)]
    if (length(lvls) == 2) {
      wilcox_res <- batch_wilcoxon(data = tme_pdata, target = v, feature = cell_cols)
      write.csv(wilcox_res,
                file = paste0("04-figs/data/04-wilcoxon_", v, ".csv"),
                row.names = FALSE)
    }
  }
}

# --- Survival analysis (IF survival data exists) ---
# Check for time + status columns
time_cols   <- grep("time|Time|OS_time|PFS_time", colnames(tme_pdata), value = TRUE)
status_cols <- grep("status|Status|OS_status|PFS_status", colnames(tme_pdata), value = TRUE)

if (length(time_cols) > 0 && length(status_cols) > 0) {
  time_col   <- time_cols[1]
  status_col <- status_cols[1]

  # Batch Cox regression for all TME cell columns
  surv_res <- batch_surv(pdata = tme_pdata, variable = cell_cols,
                          time = time_col, status = status_col)
  write.csv(surv_res, file = "04-figs/data/05-surv_results.csv", row.names = FALSE)

  # Per-method survival (if multiple deconvolution methods were run)
  methods_run <- unique(gsub(".*_", "", cell_cols))
  for (m in methods_run) {
    m_cols <- grep(paste0("_", m, "$"), cell_cols, value = TRUE)
    if (length(m_cols) > 0) {
      m_surv <- batch_surv(pdata = tme_pdata, variable = m_cols,
                           time = time_col, status = status_col)
      write.csv(m_surv,
                file = paste0("04-figs/data/05-surv_", m, ".csv"),
                row.names = FALSE)
    }
  }
}

Key notes for survival:

  • batch_surv() requires merged pdata — All variables must be columns in pdata. Use merge() first.
  • batch_surv() parameter is variable — A vector of column names, NOT a separate data frame.
  • batch_surv() output columns: P, HR, CI_low_0.95, CI_up_0.95, ID.

4d. Correlation

Correlation matrix of all TME variables:

# batch_cor(data, target, feature, method) — correlates one target vs multiple features
# NOTE: batch_cor() returns target-vs-each-feature, NOT a pairwise matrix.
# For a full pairwise correlation matrix, use base R cor() instead (see below).
# Key immune cells as target, all TME cells as features
key_cells <- c("T_cells_CD8_CIBERSORT", "Macrophages_M1_CIBERSORT")
cor_all <- data.frame()
for (tc in key_cells) {
  cor_res <- batch_cor(data = tme_pdata, target = tc, feature = cell_cols, method = "spearman")
  cor_res$target <- tc
  cor_all <- rbind(cor_all, cor_res)
}
write.csv(cor_all, file = "04-figs/data/06-cor_matrix.csv", row.names = FALSE)

# For full pairwise correlation matrix (for heatmap visualization), use base R:
cor_mat <- cor(tme_pdata[, cell_cols], method = "spearman", use = "pairwise.complete.obs")
save(cor_mat, file = "04-figs/data/06-cor_matrix_full.RData")

4e. Other Statistical Functions

Function Purpose Output
batch_pcc() Batch partial correlation controlling for a third variable Statistics table
iobr_deg() DEG using DESeq2 or limma, with built-in volcano plot and heatmap Statistics + figures
sig_gsea() Gene Set Enrichment Analysis via fgsea, with built-in plots Statistics + figures
subgroup_survival() Subgroup Cox analysis by covariates Statistics table
LR_cal() Ligand-receptor interaction analysis Statistics + figures
find_mutations() Mutation-TME interaction (TCGA only) Statistics table
feature_manipulation() Clean features: remove NA, outliers, zero-variance Cleaned data

Phase 5: Visualization

Script: 05-visualization.R

ALL TME data MUST be z-score scaled before visualization. Statistical result tables are in 04-figs/data/.

Every figure MUST:

  1. Use IOBR's built-in visualization functions whenever available
  2. Export as both PNG (300dpi) + PDF
  3. Follow the strict ordering below

IMPORTANT: Add pdf(NULL) at the top of the visualization script to suppress stray Rplots.pdf from IOBR internal plot calls. Also clean up at the end: if (file.exists("Rplots.pdf")) file.remove("Rplots.pdf").

sig_box Standard Parameters (MUST use for ALL boxplots)

Every sig_box() call in this pipeline MUST use these fixed parameters:

sig_box(data = data, signature = feature_col, variable = group_col,
        palette = "paired1", show_pairwise_p = TRUE,
        angle_x_text = 60, hjust = 1,
        size_of_pvalue = 3, size_of_font = 5)

For multi-panel layouts, combine individual sig_box() plots with patchwork::wrap_plots().

Fig 01: CIBERSORT Barplot (22 cell types)

IMPORTANT: Only methods that produce cell FRACTION data (CIBERSORT, EPIC) should use cell_bar_plot(). ESTIMATE scores are NOT fractions.

tme_cb <- read.csv("03-tme/tme_cibersort.csv")

# Filter QC columns
qc_cols <- grep("^[Pp][.]", colnames(tme_cb), value = TRUE)
qc_cols <- c(qc_cols, grep("^Correlation", colnames(tme_cb), value = TRUE),
                  grep("^RMSE", colnames(tme_cb), value = TRUE))
cell_cols_cb <- setdiff(grep("_CIBERSORT$", colnames(tme_cb), value = TRUE), qc_cols)

p1 <- cell_bar_plot(input = tme_cb, id = "ID", features = cell_cols_cb,
                     title = "CIBERSORT TME Composition",
                     coord_flip = TRUE, palette = 4, legend.position = "right")
p1 <- p1 + guides(fill = guide_legend(ncol = 1)) +
     theme(legend.text = element_text(size = 5),
           legend.key.size = unit(2.5, "mm"),
           legend.title = element_blank(),
           plot.title = element_text(size = 10, hjust = 0.5))
if (nrow(tme_cb) > 50) {
  p1 <- p1 + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
}

ggsave("04-figs/Fig01-barplot_cibersort.png", p1, width = 180, height = 250, units = "mm", dpi = 300)
ggsave("04-figs/Fig01-barplot_cibersort.pdf", p1, width = 180, height = 250, units = "mm")

Fig 02: EPIC Barplot (8 cell types)

tme_ep <- read.csv("03-tme/tme_epic.csv")
epic_cols <- grep("_EPIC$", colnames(tme_ep), value = TRUE)

p2 <- cell_bar_plot(input = tme_ep, id = "ID", features = epic_cols,
                     title = "EPIC TME Composition",
                     coord_flip = TRUE, palette = 4, legend.position = "right")
p2 <- p2 + guides(fill = guide_legend(ncol = 1)) +
     theme(legend.text = element_text(size = 5),
           legend.key.size = unit(2.5, "mm"),
           legend.title = element_blank(),
           plot.title = element_text(size = 10, hjust = 0.5))

ggsave("04-figs/Fig02-barplot_epic.png", p2, width = 180, height = 250, units = "mm", dpi = 300)
ggsave("04-figs/Fig02-barplot_epic.pdf", p2, width = 180, height = 250, units = "mm")

Fig 03a-c: TME Heatmaps WITHOUT Subtype (z-score)

Cluster both rows and columns, no subtype annotation. Use pheatmap.

plot_heatmap_no_subtype <- function(data, cols, title, prefix,
                                     width = 8, height = 6) {
  mat <- t(as.matrix(data[, cols]))  # variables as rows, samples as columns
  mat[mat > 2] <- 2; mat[mat < -2] <- -2
  rownames(mat) <- sub("\\.[xy]$", "", rownames(mat))
  colnames(mat) <- data$ID

  for (fmt in c("pdf", "png")) {
    pheatmap::pheatmap(mat,
      color = colorRampPalette(c("#2166AC", "white", "#B2182B"))(100),
      show_colnames = FALSE, cluster_rows = TRUE, cluster_cols = TRUE,
      treeheight_row = 15, treeheight_col = 15, fontsize_row = 8,
      main = title,
      filename = paste0("04-figs/", prefix, ".", fmt),
      width = width, height = height, dpi = 300)
  }
}

# Fig03a: CIBERSORT
plot_heatmap_no_subtype(tme_scaled, cb_sc, "CIBERSORT TME Landscape",
                        "Fig03a-heatmap_cibersort")

# Fig03b: MCPcounter
mcp_sc <- grep("MCPcounter$", colnames(tme_scaled), value = TRUE)
plot_heatmap_no_subtype(tme_scaled, mcp_sc, "MCPcounter TME Landscape",
                        "Fig03b-heatmap_mcpcounter", width = 6, height = 5)

# Fig03c: xCell
xc_sc <- grep("xCell$", colnames(tme_scaled), value = TRUE)
plot_heatmap_no_subtype(tme_scaled, xc_sc, "xCell TME Landscape",
                        "Fig03c-heatmap_xcell", width = 10, height = 8)

Fig 04a-d: Survival Forest Plots (IF survival data exists)

USE sig_forest() — IOBR's built-in forest plot function. MUST filter extreme HR before plotting: remove variables with abs(HR) > 10 to prevent sig_forest() from producing blank plots (its coord_trans(x = "log2") only has 3 fixed breaks that break with extreme HR ranges).

if (length(time_cols) > 0 && length(status_cols) > 0) {
  time_col   <- time_cols[1]
  status_col <- status_cols[1]

  # Fig04a: CIBERSORT survival forest
  cb_surv <- batch_surv(pdata = tme_pdata, variable = cb_sc,
                        time = time_col, status = status_col)
  write.csv(cb_surv, file = "04-figs/data/05-surv_cibersort.csv", row.names = FALSE)
  cb_surv_filt <- cb_surv[abs(as.numeric(cb_surv$HR)) <= 10, ]
  if (nrow(cb_surv_filt) > 0) {
    p04a <- sig_forest(cb_surv_filt, signature = "ID", n = 20)
    ggsave("04-figs/Fig04a-forest_cibersort.png", p04a, width = 10, height = 6, units = "in", dpi = 300)
    ggsave("04-figs/Fig04a-forest_cibersort.pdf", p04a, width = 10, height = 6)
  }

  # Fig04b: MCPcounter + EPIC + xCell survival forest
  other_cols <- c(mcp_sc, epic_sc, xc_sc)
  other_surv <- batch_surv(pdata = tme_pdata, variable = other_cols,
                           time = time_col, status = status_col)
  write.csv(other_surv, file = "04-figs/data/05-surv_other_tme.csv", row.names = FALSE)
  other_surv_filt <- other_surv[abs(as.numeric(other_surv$HR)) <= 10, ]
  if (nrow(other_surv_filt) > 0) {
    p04b <- sig_forest(other_surv_filt, signature = "ID", n = 20)
    ggsave("04-figs/Fig04b-forest_other_tme.png", p04b, width = 10, height = 6, units = "in", dpi = 300)
    ggsave("04-figs/Fig04b-forest_other_tme.pdf", p04b, width = 10, height = 6)
  }

  # Fig04c: TME signatures survival forest
  tme_sig_surv <- batch_surv(pdata = tme_pdata, variable = tme_sig_cols,
                             time = time_col, status = status_col)
  write.csv(tme_sig_surv, file = "04-figs/data/05-surv_tme_signatures.csv", row.names = FALSE)
  tme_sig_surv_filt <- tme_sig_surv[abs(as.numeric(tme_sig_surv$HR)) <= 10, ]
  if (nrow(tme_sig_surv_filt) > 0) {
    p04c <- sig_forest(tme_sig_surv_filt, signature = "ID", n = 20)
    ggsave("04-figs/Fig04c-forest_tme_signature.png", p04c, width = 10, height = 6, units = "in", dpi = 300)
    ggsave("04-figs/Fig04c-forest_tme_signature.pdf", p04c, width = 10, height = 6)
  }

  # Fig04d: GO/KEGG/Hallmark survival forest
  pathway_surv <- batch_surv(pdata = tme_pdata, variable = pathway_cols,
                             time = time_col, status = status_col)
  write.csv(pathway_surv, file = "04-figs/data/05-surv_pathways.csv", row.names = FALSE)
  pathway_surv_filt <- pathway_surv[abs(as.numeric(pathway_surv$HR)) <= 10, ]
  if (nrow(pathway_surv_filt) > 0) {
    p04d <- sig_forest(pathway_surv_filt, signature = "ID", n = 20)
    ggsave("04-figs/Fig04d-forest_go_kegg.png", p04d, width = 10, height = 6, units = "in", dpi = 300)
    ggsave("04-figs/Fig04d-forest_go_kegg.pdf", p04d, width = 10, height = 6)
  }
}

Key notes:

  • MUST filter abs(HR) > 10 before sig_forest() — extreme HR values (e.g., 0.001–190) cause sig_forest() to produce blank plots because its internal coord_trans(x = "log2") only has 3 fixed axis breaks.
  • Save full (unfiltered) survival results to 04-figs/data/ — filtering is only for visualization.
  • Split into 4 groups: CIBERSORT, MCPcounter+EPIC+xCell, TME Signatures, GO/KEGG/Hallmark.

Fig 05: TME Cell Correlation Matrix Heatmap

Input: CIBERSORT + MCPcounter + ESTIMATE variables. Use pheatmap on Spearman correlation matrix.

load("04-figs/data/06-cor_matrix_full.RData")

# Deduplicate: keep .x (cibersort), drop .y (cibersort_abs)
keep <- !grepl("\\.[xy]$", rownames(cor_mat)) | grepl("\\.x$", rownames(cor_mat))
cor_mat <- cor_mat[keep, keep]
rownames(cor_mat) <- sub("\\.x$", "", rownames(cor_mat))
colnames(cor_mat) <- sub("\\.x$", "", colnames(cor_mat))

for (fmt in c("pdf", "png")) {
  pheatmap::pheatmap(cor_mat, fontsize = 6, fontsize_row = 5, fontsize_col = 5,
                      color = colorRampPalette(c("#2166AC", "white", "#B2182B"))(100),
                      border_color = NA, main = "TME Cell Correlation Matrix",
                      filename = paste0("04-figs/Fig05-cor_matrix.", fmt),
                      width = 8, height = 7, dpi = 300)
}

Fig 06a-d: TME Heatmaps BY Subtype + KM Plot

Heatmap dimension rules (MUST follow):

Element Position Parameter
Variables (cell types) Rows (y-axis) rownames(mat) — shows readable names
Samples Columns (x-axis) show_colnames=FALSE
Subtype color bar Top (column annotation) annotation_col
Subtype gaps Column gaps gaps_col
Clustering Rows only cluster_rows=TRUE, cluster_cols=FALSE
plot_subtype_heatmap <- function(data, cols, subtype_col, title, prefix,
                                  width = 8, height = 6) {
  data <- data[order(data[[subtype_col]]), ]
  mat <- t(as.matrix(data[, cols]))  # variables -> rows, samples -> columns
  mat[mat > 2] <- 2; mat[mat < -2] <- -2
  rownames(mat) <- sub("\\.[xy]$", "", rownames(mat))
  colnames(mat) <- data$ID

  ann_col <- data.frame(Subtype = data[[subtype_col]], row.names = data$ID)
  subtypes <- sort(unique(data[[subtype_col]]))
  sub_colors <- c("#E64B35", "#4DBBD5", "#00A087", "#3C5488", "#F39B7F", "#8491B6")
  ann_colors <- list(Subtype = setNames(sub_colors[seq_along(subtypes)], subtypes))
  gaps <- cumsum(table(data[[subtype_col]]))[-length(subtypes)]

  for (fmt in c("pdf", "png")) {
    pheatmap::pheatmap(mat, annotation_col = ann_col, annotation_colors = ann_colors,
      color = colorRampPalette(c("#2166AC", "white", "#B2182B"))(100),
      show_colnames = FALSE, cluster_rows = TRUE, cluster_cols = FALSE,
      treeheight_row = 15, gaps_col = gaps, fontsize_row = 8,
      main = title,
      filename = paste0("04-figs/", prefix, ".", fmt), width = width, height = height, dpi = 300)
  }
}

# Fig06a: CIBERSORT by subtype
hm_data_cb <- tme_scaled[, c("ID", cb_sc, "TME_subtype")]
plot_subtype_heatmap(hm_data_cb, cb_sc, "TME_subtype",
  "CIBERSORT TME by Subtype", "Fig06a-heatmap_cibersort_subtype")

# Fig06b: MCPcounter by subtype
hm_data_mcp <- tme_scaled[, c("ID", mcp_sc, "TME_subtype")]
plot_subtype_heatmap(hm_data_mcp, mcp_sc, "TME_subtype",
  "MCPcounter TME by Subtype", "Fig06b-heatmap_mcpcounter_subtype", width = 6, height = 5)

# Fig06c: xCell by subtype
xc_sc_sub <- intersect(xc_sc, colnames(tme_scaled))
hm_data_xc <- tme_scaled[, c("ID", xc_sc_sub, "TME_subtype")]
plot_subtype_heatmap(hm_data_xc, xc_sc_sub, "TME_subtype",
  "xCell TME by Subtype", "Fig06c-heatmap_xcell_subtype", width = 10, height = 8)

# Fig06d: KM plot by subtype (IF survival data exists)
# sig_surv_plot does NOT work with arbitrary categorical variables.
# Use survminer directly for KM by TME subtype.
if (length(time_cols) > 0 && length(status_cols) > 0) {
  library(survival)
  library(survminer)
  surv_data <- read.csv("04-figs/data/01-tme_pdata_merged.csv")
  dir.create("04-figs/kmplot", showWarnings = FALSE, recursive = TRUE)

  surv_obj <- Surv(time = surv_data[[time_col]], event = surv_data[[status_col]])
  km_fit <- survfit(surv_obj ~ TME_subtype, data = surv_data)
  p_km <- ggsurvplot(km_fit, data = surv_data,
    pval = TRUE, conf.int = FALSE,
    palette = c("#E64B35", "#4DBBD5", "#00A087", "#3C5488", "#F39B7F", "#8491B6"),
    legend.title = "TME Subtype", legend.labs = sort(unique(surv_data$TME_subtype)),
    xlab = "Time (months)", ylab = "Overall Survival",
    risk.table = TRUE, fontsize = 3, surv.median.line = "hv")

  pdf("04-figs/Fig06d-km_subtype.pdf", width = 7, height = 6)
  print(p_km)
  dev.off()
  png("04-figs/Fig06d-km_subtype.png", width = 7, height = 6, units = "in", res = 300)
  print(p_km)
  dev.off()
}

pheatmap notes:

  • Do NOT re-scale — data is already z-scored in Phase 4. Just cap at ±2.
  • treeheight_row = 15 — shrink clustering tree.
  • Must transpose — pheatmap needs variables as rows for readable y-axis labels.
  • cluster_cols = FALSE — columns ordered by subtype, clustering would destroy ordering.
  • CRITICAL: Do NOT use grep("CIBERSORT") in merged data — cibersort + cibersort_abs share cell names. Read from original cibersort CSV instead.

Fig 07a-e: Most Significant Positive Variable Per Subtype (sig_box + patchwork)

For each method group, find the most significant and positive variable for each TME subtype, create one sig_box per subtype, and combine with patchwork.

library(patchwork)

find_top_positive_per_subtype <- function(data, cell_cols, subtype_col = "TME_subtype") {
  subtypes <- sort(unique(data[[subtype_col]]))
  result <- character(0)
  for (st in subtypes) {
    st_data <- data[data[[subtype_col]] == st, cell_cols, drop = FALSE]
    cell_means <- colMeans(st_data, na.rm = TRUE)
    result[st] <- names(sort(cell_means, decreasing = TRUE))[1]
  }
  return(result)
}

make_subtype_boxplot_panel <- function(data, method_cols, subtype_col, title, filename) {
  top_cells <- find_top_positive_per_subtype(data, method_cols, subtype_col)
  subtypes <- names(top_cells)

  box_list <- lapply(subtypes, function(st) {
    col <- top_cells[st]
    clean_name <- sub("\\.[xy]$", "", col)
    sig_box(data = data, signature = col, variable = subtype_col,
            palette = "paired1", show_pairwise_p = TRUE,
            angle_x_text = 60, hjust = 1,
            size_of_pvalue = 3, size_of_font = 5) +
      ggtitle(paste0(st, ": ", clean_name)) +
      theme(axis.title = element_text(size = 6),
            axis.text.y = element_text(size = 5),
            plot.title = element_text(size = 7),
            plot.margin = margin(12, 5, 5, 5))
  })

  p <- wrap_plots(box_list, ncol = length(subtypes))
  ggsave(paste0("04-figs/", filename, ".png"), p,
         width = 50 * length(subtypes), height = 100, units = "mm", dpi = 300)
  ggsave(paste0("04-figs/", filename, ".pdf"), p,
         width = 50 * length(subtypes), height = 100, units = "mm")
}

# Fig07a: CIBERSORT
make_subtype_boxplot_panel(tme_scaled, cb_sc, "TME_subtype",
                           "CIBERSORT Subtype Boxplots", "Fig07a-boxplot_cibersort_subtype")

# Fig07b: MCPcounter
make_subtype_boxplot_panel(tme_scaled, mcp_sc, "TME_subtype",
                           "MCPcounter Subtype Boxplots", "Fig07b-boxplot_mcpcounter_subtype")

# Fig07c: xCell
make_subtype_boxplot_panel(tme_scaled, xc_sc, "TME_subtype",
                           "xCell Subtype Boxplots", "Fig07c-boxplot_xcell_subtype")

# Fig07d: TME signatures
# Identify TME signature columns from Phase 3 output
tme_sig_sc <- colnames(tme_scaled)[colnames(tme_scaled) %in% sig_tme_names]
if (length(tme_sig_sc) > 0) {
  make_subtype_boxplot_panel(tme_scaled, tme_sig_sc, "TME_subtype",
                             "TME Signatures Subtype Boxplots", "Fig07d-boxplot_tme_signature_subtype")
}

# Fig07e: GO/KEGG/Hallmark
pathway_sc <- grep("^HALLMARK_|^KEGG_|^GO_|^REACTOME_", colnames(tme_scaled), value = TRUE)
if (length(pathway_sc) > 0) {
  make_subtype_boxplot_panel(tme_scaled, pathway_sc, "TME_subtype",
                             "Pathway Subtype Boxplots", "Fig07e-boxplot_pathway_subtype")
}

Fig 08a-b: Top 10 Differential Boxplots (IF categorical variables exist)

For each binary variable: batch_wilcoxon() → top 10 → sig_box + patchwork (2×5). For each multi-level categorical variable: batch_kruskal() → top 10 → sig_box + patchwork (2×5).

pdata_summary <- read.csv("02-input/pdata_summary.csv")
cat_vars <- pdata_summary$var[pdata_summary$type %in% c("factor", "binary")]
fig_idx <- 0

if (length(cat_vars) > 0) {
  # Collect ALL TME variables (deconvolution + signatures + pathways)
  # MUST filter to NUMERIC columns only — batch_wilcoxon/batch_kruskal require numeric features
  pdata_col_set <- colnames(pdata)
  all_tme_vars <- setdiff(colnames(tme_pdata), pdata_col_set)
  all_tme_vars <- setdiff(all_tme_vars, "ID")
  # Keep only columns that are numeric
  all_tme_vars <- all_tme_vars[sapply(tme_pdata[, all_tme_vars], is.numeric)]

  for (v in cat_vars) {
    lvls <- unique(tme_pdata[[v]])
    lvls <- lvls[!is.na(lvls)]

    if (length(lvls) >= 2) {
      fig_idx <- fig_idx + 1
      fig_letter <- letters[fig_idx]  # a, b, c, ...

      if (length(lvls) == 2) {
        # Binary: Wilcoxon
        diff_res <- batch_wilcoxon(data = tme_pdata, target = v, feature = all_tme_vars)
      } else {
        # Multi-level: Kruskal
        diff_res <- batch_kruskal(data = tme_pdata, target = v, feature = all_tme_vars)
      }

      diff_res <- diff_res[order(diff_res$p.value), ]
      top10 <- head(diff_res, 10)
      write.csv(diff_res, file = paste0("04-figs/data/04-diff_", v, ".csv"), row.names = FALSE)

      if (nrow(top10) > 0) {
        box_list <- lapply(seq_len(nrow(top10)), function(i) {
          sig_box(data = tme_pdata, signature = top10$sig_names[i],
                  variable = v, palette = "paired1", show_pairwise_p = TRUE,
                  angle_x_text = 60, hjust = 1,
                  size_of_pvalue = 3, size_of_font = 5) +
            theme(axis.title = element_text(size = 5),
                  axis.text.y = element_text(size = 5),
                  plot.title = element_text(size = 5),
                  plot.margin = margin(12, 5, 5, 5))
        })

        p_combined <- wrap_plots(box_list, ncol = 5, nrow = 2)
        ggsave(paste0("04-figs/Fig08", fig_letter, "-top10_", v, ".png"), p_combined,
               width = 250, height = 220, units = "mm", dpi = 300)
        ggsave(paste0("04-figs/Fig08", fig_letter, "-top10_", v, ".pdf"), p_combined,
               width = 250, height = 220, units = "mm")
      }
    }
  }
}

Key notes:

  • Binary variable (2 levels): use batch_wilcoxon().
  • Multi-level variable (≥3 levels): use batch_kruskal().
  • Both return tibble with columns: sig_names, p.value, statistic, p.adj, log10pvalue, stars.
  • One figure per categorical variable: Fig08a for variable1, Fig08b for variable2, etc.

Fig 09: 8 KM Plots (IF survival data exists)

Select most positive (lowest HR, protective) and most negative (highest HR, risk) variable from each of 4 groups:

  1. CIBERSORT → a (most positive), b (most negative)
  2. Other TME methods → c, d
  3. TME signatures → e, f
  4. GO/KEGG/Hallmark → g, h

Total: up to 8 KM plots saved to 04-figs/kmplot/.

if (length(time_cols) > 0 && length(status_cols) > 0) {
  dir.create("04-figs/kmplot", showWarnings = FALSE, recursive = TRUE)

  find_extreme_hr <- function(surv_res, group_cols) {
    if (is.null(surv_res) || nrow(surv_res) == 0) return(character(0))
    group_res <- surv_res[surv_res$ID %in% group_cols, ]
    if (nrow(group_res) == 0) return(character(0))
    group_res <- group_res[complete.cases(group_res[, c("HR", "P")]), ]
    if (nrow(group_res) == 0) return(character(0))
    # Most positive (lowest HR = protective), most negative (highest HR = risk)
    most_pos <- group_res$ID[which.min(group_res$HR)]
    most_neg <- group_res$ID[which.max(group_res$HR)]
    return(c(most_pos, most_neg))
  }

  # Collect all survival results
  surv_all <- rbind(cb_surv, other_surv, tme_sig_surv, pathway_surv)
  km_vars <- character(0)

  # Group 1: CIBERSORT
  km_vars <- c(km_vars, find_extreme_hr(surv_all, cb_sc))
  # Group 2: Other TME methods
  other_sc <- c(mcp_sc, epic_sc, xc_sc)
  km_vars <- c(km_vars, find_extreme_hr(surv_all, other_sc))
  # Group 3: TME signatures
  km_vars <- c(km_vars, find_extreme_hr(surv_all, tme_sig_cols))
  # Group 4: GO/KEGG/Hallmark
  km_vars <- c(km_vars, find_extreme_hr(surv_all, pathway_cols))

  km_vars <- unique(km_vars[nchar(km_vars) > 0])

  surv_data <- read.csv("04-figs/data/01-tme_pdata_merged.csv")

  for (i in seq_along(km_vars)) {
    sig_surv_plot(input_pdata = surv_data, signature = km_vars[i],
                  time = time_col, status = status_col, time_type = "month",
                  palette = "jama", mini_sig = "score",
                  fig.type = "pdf", save_path = "04-figs/kmplot",
                  project = paste0("KM_", i), index = i)
  }
}

Key notes:

  • sig_surv_plot() saves its own output — Do NOT wrap with ggsave().
  • KM plots saved to 04-figs/kmplot/ — each variable generates PDF + RData.
  • mini_sig = "score" — continuous variable, auto-split by median.
  • time_type: "day", "month", or "year".

Clean up stray Rplots.pdf (generated by some IOBR internal plot calls)

Suppress with pdf(NULL) at top of visualization script

if (file.exists("Rplots.pdf")) file.remove("Rplots.pdf")

Additional Visualization Functions

Use these as needed based on analysis requirements:

Function Purpose Output
roc_time() Time-dependent ROC with AUC at multiple time points Figure (self-saving)
sig_roc() Multiple ROC curves for binary response prediction Figure
batch_sig_surv_plot() Batch KM curves for multiple signatures across projects Figures (self-saving)
surv_group() KM curve with risk table (High vs Low) Figure
subgroup_survival() Subgroup Cox analysis by covariates Statistics table
get_cor() Single pairwise correlation with scatter plot + regression line Figure
iobr_cor_plot() Batch correlation visualization (signature vs signature/phenotype) Figure
iobr_pca() PCA dimensionality reduction with scatter plot Figure
iobr_deg() DEG with built-in volcano plot and heatmap output Statistics + figures
sig_gsea() Gene Set Enrichment Analysis via fgsea, with built-in plots Statistics + figures
LR_cal() Ligand-receptor interaction analysis Statistics + figures
# Time-dependent ROC — saves its own output
roc_time(input = tme_pdata, vars = c("T_cells_CD8_CIBERSORT", "TMEscoreA_CIR"),
         time = "OS_time", status = "OS_status",
         time_point = c(12, 36, 60), time_type = "month", palette = "jama",
         path = "04-figs", fig.type = "png", index = 1)

# PCA visualization
iobr_pca(data = eset, pdata = pdata, group = "TME_subtype", scale = TRUE)

# Single correlation scatter plot
get_cor(data = tme_pdata, variable1 = "T_cells_CD8_CIBERSORT",
        variable2 = "CD_8_T_effector", method = "spearman")

IOBR Palettes

Categorical (palette_group parameter):

Name Best For
"jama" Default, 2–5 groups
"jco" Clinical oncology
"nrc" Nature publications
"aaas" Science journal
"accent" Colorblind-friendly
"set2" Many groups

Heatmap (palette parameter, numeric 1–6):

# Style
2 Blue-white-red diverging (default)
1 Classic pheatmap
3 Purple-orange
4–6 Sequential gradients

Cell bar plot (cell_bar_plot() palette parameter, numeric 1–4):

# Style
4 Recommended — rich random palette with many distinct colors
1–3 Alternative random palettes

If unsure which palette, present 3 options to the user and ask.


User Decision Points

Pause and ask at:

  1. Phase 0: Review and approve analysis plan (IOBR-pipeline.md)
  2. Phase 1: Data type (Count/TPM/Array) and species (hsa/mmu)
  3. Phase 2: Which deconvolution method(s)
  4. Phase 3: Scoring method and which signature collection
  5. Phase 5: Color palette preference

Phase 6: Wrap-Up Note (MANDATORY — execute after ALL other phases complete)

Do NOT skip this phase. After Phase 5 visualization finishes and all log files are verified, you MUST:

  1. List actual files in 04-figs/, 03-tme/, 02-input/, and 04-figs/data/ to build the tree from reality
  2. Write 05-note/IOBR-analysis-README.md with:
    • IOBR citation header at top
    • ASCII tree diagram showing ACTUAL outputs (not template placeholders)
    • Brief interpretation of key findings (deconvolution method count, subtype count, top survival hits, available figures)
  3. Count figures, tables, and log files — report these numbers in the summary

Key Findings to Extract (auto-detect from output files)

Before writing the note, scan the output:

# Count figures generated
echo "=== Figures ===" && ls 04-figs/Fig*.png | wc -l

# Deconvolution methods succeeded
echo "=== TME CSVs ===" && ls 03-tme/tme_*.csv

# Survival data availability
echo "=== Survival ===" && ls 04-figs/data/05-surv_*.csv 2>/dev/null || echo "No survival data"

# Subtype info
echo "=== Subtypes ===" && head -5 04-figs/data/03-tme_subtype.csv

# Log file check
echo "=== Logs ===" && wc -l 06-log/*.log

Template for 05-note/IOBR-analysis-README.md

# IOBR Citation:
# Zeng DQ, Fang YR, … , Yu GC, Liao WJ.
# Enhancing Immuno-Oncology Investigations Through Multidimensional Decoding of
# Tumour Microenvironment with IOBR 2.0, Cell Reports Methods, 2024
# https://doi.org/10.1016/j.crmeth.2024.100910

# <Project Name> — TME Analysis Summary

## Analysis Summary

- Dataset: <N> samples, <data_type>
- Deconvolution: <methods used, comma-separated>
- Signature scoring: <method>
- TME subtypes: <N> subtypes identified (K-means on CIBERSORT)

## Key Findings

<3-5 bullet points summarizing the most important results>

## Output Directory Tree

project/ ├── 01-script/ │ ├── 01-data_preprocessing.R │ ├── 02-tme_deconvolution.R │ ├── 03-signature_analysis.R │ ├── 04-statistical_analysis.R │ └── 05-visualization.R ├── 02-input/ │ ├── annotated_eset.csv │ ├── pdata.csv │ └── pdata_summary.csv ├── 03-tme/ │ ├── tme_cibersort.csv # CIBERSORT: 22 cell types │ ├── tme_mcpcounter.csv # MCPcounter: 8 cell populations │ ├── tme_estimate.csv # ESTIMATE: 4 scores │ ├── sig_score_tme.csv # TME signature scores (ssGSEA) │ └── tme_sig_combine.csv # Merged master table ├── 04-figs/ │ ├── Fig01-barplot_cibersort.png/pdf │ ├── Fig02-barplot_epic.png/pdf │ ├── Fig03a-heatmap_cibersort.png/pdf │ ├── Fig03b-heatmap_mcpcounter.png/pdf │ ├── Fig03c-heatmap_xcell.png/pdf │ ├── Fig04a-forest_cibersort.png/pdf # (IF survival) │ ├── Fig04b-forest_other_tme.png/pdf # (IF survival) │ ├── Fig04c-forest_tme_signature.png/pdf # (IF survival) │ ├── Fig04d-forest_go_kegg.png/pdf # (IF survival) │ ├── Fig05-cor_matrix.png/pdf │ ├── Fig06a-heatmap_cibersort_subtype.png/pdf │ ├── Fig06b-heatmap_mcpcounter_subtype.png/pdf │ ├── Fig06c-heatmap_xcell_subtype.png/pdf │ ├── Fig06d-km_subtype.png/pdf # (IF survival) │ ├── Fig07a-boxplot_cibersort_subtype.png/pdf │ ├── Fig07b-boxplot_mcpcounter_subtype.png/pdf │ ├── Fig07c-boxplot_xcell_subtype.png/pdf │ ├── Fig07d-boxplot_tme_signature_subtype.png/pdf │ ├── Fig07e-boxplot_pathway_subtype.png/pdf │ ├── Fig08a-top10_.png/pdf # (IF categorical) │ ├── Fig08b-top10_.png/pdf # (IF categorical) │ ├── kmplot/ # KM plots (IF survival) │ └── data/ # Statistical result tables ├── 05-note/ │ ├── IOBR-pipeline.md │ └── IOBR-analysis-README.md └── 06-log/ ├── 01-data_preprocessing.log ├── 02-tme_deconvolution.log ├── 03-signature_analysis.log ├── 04-statistical_analysis.log └── 05-visualization.log

After writing, present a one-paragraph summary to the user listing figure count, log status, and key findings.


06-log Directory (MANDATORY)

Every R script MUST be executed with log capture. No exceptions.

  • One log file per script, named to match: 01-data_preprocessing.log, 02-tme_deconvolution.log, ...
  • Each log MUST begin with the IOBR citation header (same as scripts).
  • Logs capture both stdout and stderr for debugging and reproducibility.
  • After running all scripts, verify: ls -la 06-log/ should show 5 non-empty log files.
  • If a script fails, check the log file for the error message.

Run EVERY script with this pattern:

mkdir -p 06-log

# Phase 1
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/01-data_preprocessing.R; } > 06-log/01-data_preprocessing.log 2>&1

# Phase 2
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/02-tme_deconvolution.R; } > 06-log/02-tme_deconvolution.log 2>&1

# Phase 3
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/03-signature_analysis.R; } > 06-log/03-signature_analysis.log 2>&1

# Phase 4
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/04-statistical_analysis.R; } > 06-log/04-statistical_analysis.log 2>&1

# Phase 5
{ echo "# IOBR Citation:"; echo "# Zeng DQ et al., Cell Reports Methods, 2024"; echo "# https://doi.org/10.1016/j.crmeth.2024.100910"; echo ""; Rscript 01-script/05-visualization.R; } > 06-log/05-visualization.log 2>&1

Quick Start Checklist

  1. Check IOBR installation in base environment
  2. Create project directory structure (01-script through 06-log + 04-figs/data + 04-figs/kmplot)
  3. Confirm data source, type, and species with user
  4. Generate 05-note/IOBR-pipeline.md — present to user for approval
  5. Write 01-data_preprocessing.R (with citation header), run with log capture to 06-log/
  6. Verify sample matching (eset vs pdata), generate 02-input/pdata_summary.csv
  7. Ask user to choose deconvolution method(s)
  8. Write 02-tme_deconvolution.R, run with log capture to 06-log/
  9. Ask user about scoring method and gene sets
  10. Write 03-signature_analysis.R, run with log capture to 06-log/
  11. Write 04-statistical_analysis.R — merge, scale, cluster, test, correlate, save to 04-figs/data/
  12. Write 05-visualization.R — follow Fig01→09 ordering with conditional figures (survival, categorical)
  13. Run 05-visualization.R with log capture to 06-log/
  14. Phase 6 (MANDATORY): List actual outputs → count figures/tables/logs → extract key findings → write 05-note/IOBR-analysis-README.md with tree + summary
  15. Verify all 5 log files in 06-log/ are non-empty with citation headers

References (in this skill directory)

  • references/functions.md — IOBR function parameter reference
  • references/palettes.md — Color palette selection guide
  • references/iobr_built_in_data.md — Complete catalog of built-in signatures, annotations, and datasets
  • references/iobr_pipeline_template.R — Full production pipeline template (9-method deconvolution + signature + pathway scoring)

Read these when you need specific parameter details, palette options, want to know what signatures are available, or need a complete pipeline template to adapt.