| output |
|
|---|
Long-read assembly reveals vast isoform diversity in the placenta associated with metabolic and endocrine function
Overview: This section prepares the orthogonal datasets required for long-read transcriptome assembly validation and quality control. The pipeline integrates multiple data types to provide comprehensive support for transcript structure validation:
- Short-read RNA-seq (GUSTO, N=200): Provides splice junction validation and read coverage support
- CAGE-seq (FANTOM5): Identifies authentic transcription start sites (TSS)
- DNase-seq (ENCODE): Maps open chromatin regions associated with active promoters
- Poly(A) sites (APASdb + PolyA_DB): Validates transcription termination sites (TTS)
Short-read RNA-seq from term, villous placenta from GUSTO (N = 200) were aligned to hg38 using STAR with settings recommended by ENCODE. Splice junctions with < 10 supporting reads were filtered. CAGE-seq reads from placenta tissue (N = 1) and cells (pericyte, N = 2; epithelial, N = 4, trophoblast, N = 1) were retrieved from the FANTOM5 Project, trimmed with cutadapt v4.1 to remove linker adapters, EcoP15 and 5'poly-G sequences, and filtered reads were aligned to h38 using STAR with settings recommended by ENCODE. CAGE tags were counted and clustered using CAGEr in R. DNAse-seq peaks mapped to h38 were retrieved in BED format from the ENCODE consortium and intersected with bedtools. SAPAS peaks in placenta tissue mapped to hg38 were retrieved in BED format from APASdb, and human polyA motifs annotated in PolyA_DB were retrieved from PolyA-miner.
Note: all analyses performed here were repeated for the GTEx v9 long-read data, which are available under controlled access via dbGaP accession phs000424.v9 and on AnVIL. See the manuscript and orthogonal_for_GTEx.txt for details.
################################################################################
# Directory Structure and File Paths
################################################################################
# Primary data directories
DIR_RAW=/path/to/raw_reads # Raw sequencing data
DIR_TRIM=/path/to/trimmed_reads # Quality-trimmed reads
DIR_ALIGN=/path/to/alignment_files # STAR alignment outputs
DIR_CAGE=/path/to/processed_cage-seq_data # CAGE-seq processing results
DIR_DNAse=/path/to/ENCODE_DNase-seq_narrowPeaks # ENCODE DNase-seq peaks
DIR_ORTHOGONAL=/path/to/orthogonal_datasets_for_sqanti3 # Final SQANTI3 inputs
# Processing and temporary directories
DIR_INDEX=/path/to/preprocessing_indices_directory # Reference genome indices
DIR_FMLRC2=/path/to/corrected_reads_directory # FMLRC2 error correction
DIR_REPORTS=/path/to/reports_directory # QC and processing reports
DIR_METRICS=/path/to/metrics_directory # Alignment and trimming metrics
# Reference genome and annotation files
GENOME_FASTA=/path/to/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna # hg38 genome
REFERENCE_GTF=/path/to/gencode_v45_annotation.gtf # GENCODE v45Purpose: Generate STAR genome indices optimized for RNA-seq alignment to hg38 reference genome.
- star v2.7.4a
# Generate STAR genome index for hg38 reference
# Uses default parameters optimized for human genome size and complexity
STAR --runThreadN ${THREADS} --runMode genomeGenerate \
--genomeDir ${DIR_INDEX}/star-2.7.4a_GCA_000001405.15_GRCh38_no_alt_analysis_set \
--genomeFastaFiles ${GENOME_FASTA}
GENOME_INDEX_STAR=${DIR_INDEX}/star-2.7.4a_GCA_000001405.15_GRCh38_no_alt_analysis_setPurpose: Process short-read RNA-seq data from 200 placental samples to generate:
- High-quality splice junction annotations for long-read validation
- Read coverage data for transcript support validation
- Error-corrected reads for FMLRC2 long-read polishing
- samtools v1.9
- fastp v0.23.2
- star v2.7.4a
# Define sample identifiers
LIBIDs=() # SampleIDs as in .fastq file names, e.g. Sample01.fastq.gzFor each $LIBID n in ${LIBIDs[1...n]}
# Trim adapters and low-quality bases using fastp
# Automatic adapter detection and quality filtering
fastp \
-i ${DIR_RAW}/${LIBID}_1.fastq.gz -o ${DIR_TRIM}/${LIBID}_1.fastq.gz \
-I ${DIR_RAW}/${LIBID}_2.fastq.gz -O ${DIR_TRIM}/${LIBID}_2.fastq.gz \
-w ${THREADS} Two-pass mode benefits:
- First pass: Discovers novel splice junctions
- Second pass: Re-aligns reads using discovered junctions for improved accuracy
For each $LIBID n in ${LIBIDs[1...n]}
# STAR alignment with ENCODE-recommended parameters
# Optimized for splice junction discovery and accurate alignment
STAR --genomeDir ${GENOME_INDEX_STAR} \
--readFilesIn ${DIR_TRIM}/${LIBID}_1.fastq.gz ${DIR_TRIM}/${LIBID}_2.fastq.gz \
--readFilesCommand zcat --runThreadN ${THREADS} \
--genomeLoad NoSharedMemory --outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 --alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 --alignIntronMax 1000000 \
--alignMatesGapMax 1000000 --outSAMheaderHD @HD VN:1.4 SO:coordinate \
--outSAMunmapped Within --outFilterType BySJout \
--outSAMattributes NH HI AS NM MD --outSAMtype BAM SortedByCoordinate \
--sjdbScore 1 --outTmpDir ${DIR_TEMP}/${LIBID} \
--outFileNamePrefix ${DIR_ALIGN}/${LIBID} \
--outBAMsortingBinsN 200 \
--limitBAMsortRAM 80000000000 \
--twopassMode Basic
# Index BAM files for efficient downstream processing
samtools index ${DIR_ALIGN}/${LIBID}Aligned.sortedByCoord.out.bamExtract high-quality aligned reads for FMLRC2 error correction:
# Extract only properly mapped reads (FLAG 4 = unmapped)
# Convert to FASTQ for FMLRC2 BWT construction
samtools fastq -F 4 -@ 12 ${DIR_ALIGN}/GUSTO/${LIBID}Aligned.sortedByCoord.out.bam \
-1 ${DIR_ALIGN}/GUSTO/${LIBID}_1.fastq.gz \
-2 ${DIR_ALIGN}/GUSTO/${LIBID}_2.fastq.gzPurpose: Aggregate splice junction support across all samples for transcript validation.
📜 Performed in R with merge_SJs.R.
Purpose: Build Burrows-Wheeler Transform (BWT) index from short-read data for long-read error correction. FMLRC2 uses short-read k-mers to correct sequencing errors in long reads while preserving splice junction information.
- samtools v1.9
- fmlrc2 v0.1.8
- ropebwt2
# Extract all short-read sequences for BWT construction
echo "Decompressing, combining, and isolating short reads"
gunzip -c ${DIR_ALIGN}/GUSTO/*.fastq.gz | \
awk 'NR % 4 == 2' | \
tr NT TN > ${DIR_FMLRC2}/TEMP/short-reads.seq
# Build compressed BWT index for efficient k-mer lookup
echo "Building BWT"
ropebwt3 build -t ${THREADS} -L -R ${DIR_FMLRC2}/TEMP/short-reads.seq | \
tr NT TN | \
fmlrc2-convert ${DIR_INDEX}/msbwt_GUSTO.npyPurpose: Process CAGE-seq data to identify authentic transcription start sites. CAGE (Cap Analysis Gene Expression) specifically captures the 5' ends of capped mRNAs, providing precise TSS annotations.
Data sources: FANTOM5 Project placenta samples
- Tissue: N=1
- Cell types: pericyte (N=2), epithelial (N=4), trophoblast (N=1)
Technical note: DRR accessions were uploaded without quality scores, requiring separate processing from SRR accessions.
- cutadapt v4.1
- samtools v1.9
- star v2.7.4a
- bedtools v2.31.1
- sra-tools
# FANTOM5 placenta CAGE-seq accessions
CAGE_IDS=(DRR009582 SRR530926 SRR530927 SRR530924 SRR530925 \
DRR009276 DRR009277 DRR009278 DRR008721)
for ACCESSION in "${CAGE_IDS[@]}"
do
prefetch $ACCESSION
fasterq-dump $ACCESSION
doneDRR_IDS=(DRR009582 DRR009276 DRR009277 DRR009278 DRR008721)For each $FILE n in ${DRR_IDs[1...n]}
# Convert FASTQ to FASTA (no quality scores available)
sed -n '1~4s/^@/>/p;2~4p' ${DIR_RAW}/${FILE}.fastq > ${DIR_RAW}/${FILE}.fasta
# Trim EcoP15 linker sequences (CAGE-seq library preparation artifact)
# Pattern: CAGCAG...TCGTATGCCGTCTTC (variable length middle section)
cutadapt -a ^CAGCAG...TCGTATGCCGTCTTC \
--match-read-wildcards --minimum-length 15 \
--cores=${THREADS} \
-o ${DIR_TRIM}/${FILE}_atrimmed.fasta ${DIR_RAW}/${FILE}.fasta \
> ${DIR_METRICS}/${FILE}_atrimmed.metrics.txt
rm ${DIR_RAW}/${FILE}.fasta
# Trim 5'poly-G sequences (common CAGE-seq artifact)
# These can interfere with accurate TSS mapping
cutadapt -g ^G -e 0 --match-read-wildcards \
--cores=${THREADS} \
-o TRIM/${FILE}_gtrimmed.fasta TRIM/${FILE}_atrimmed.fasta \
> METRICS/${FILE}_gtrimmed.metrics.txt SRR_IDS=(SRR530926 SRR530927 SRR530924 SRR530925)For each $FILE n in ${SRR_IDs[1...n]}
# Trim EcoP15 linker sequences
cutadapt -a ^CAGCAG...TCGTATGCCGTCTTC \
--match-read-wildcards --minimum-length 15 \
--cores=${THREADS} \
-o ${DIR_TRIM}/${FILE}_atrimmed.fastq ${DIR_RAW}/${FILE}.fastq \
> ${DIR_METRICS}/${FILE}_atrimmed.metrics.txt
# Trim 5'poly-G sequences
cutadapt -g ^G -e 0 --match-read-wildcards \
--cores=${THREADS} \
-o ${DIR_TRIM}/${FILE}_gtrimmed.fastq ${DIR_TRIM}/${FILE}_atrimmed.fastq \
> ${DIR_METRICS}/${FILE}_gtrimmed.metrics.txt For each $FILE n in ${FILES[1...n]}
# STAR alignment optimized for short CAGE tags
# More permissive settings due to short read length (15-50bp typical)
STAR --genomeDir ${GENOME_INDEX_STAR} \
--readFilesIn ${DIR_TRIM}/${FILE}_gtrimmed.fasta \
--runThreadN ${THREADS} \
--outSAMtype BAM SortedByCoordinate \
--genomeLoad NoSharedMemory --outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 --alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 --alignIntronMax 1000000 \
--alignMatesGapMax 1000000 --outSAMheaderHD @HD VN:1.4 SO:coordinate \
--outSAMunmapped Within --outFilterType BySJout \
--outSAMattributes NH HI AS NM MD \
--sjdbScore 1 \
--outBAMsortingBinsN 200 \
--limitBAMsortRAM 80000000000 \
--twopassMode Basic \
--outTmpDir ${DIR_CAGE}/TEMP/${FILE} \
--outFileNamePrefix ${DIR_CAGE}/${FILE}Purpose: Convert aligned CAGE reads to precise TSS coordinates with read count support.
For each $FILE n in ${FILES[1...n]}
# Filter for high-quality alignments only
# Q30 = 99.9% accuracy, essential for precise TSS mapping
samtools view -@ 8 -F 4 -q 30 -u ${DIR_CAGE}/${FILE}Aligned.sortedByCoord.out.bam \
> ${DIR_CAGE}/${FILE}Aligned.sortedByCoord.out_filtered.bam
rm ${DIR_CAGE}/${FILE}Aligned.sortedByCoord.out.bam
# Convert BAM to BED format for easier processing
bamToBed -i ${DIR_CAGE}/${FILE}Aligned.sortedByCoord.out_filtered.bam \
> ${DIR_CAGE}/${FILE}Aligned.sortedByCoord.out_filtered.bed
# Generate CTSS on positive strand
# Use 5' end of read (start position) as TSS
mkdir ${DIR_CAGE}/CTSS
awk 'BEGIN{OFS="\t"}{if($6=="+"){print $1,$2,$5}}' \
${DIR_CAGE}/${FILE}Aligned.sortedByCoord.out_filtered.bed \
| sort -k1,1 -k2,2n \
| groupBy -i stdin -g 1,2 -c 3 -o count \
| awk -v x="$FILE" 'BEGIN{OFS="\t"}{print $1,$2,$2+1, x ,$3,"+"}' >> \
${DIR_CAGE}/CTSS/${FILE}.pos_ctss.bed
# Generate CTSS on negative strand
# Use 3' end of read (end position) as TSS for antisense transcripts
awk 'BEGIN{OFS="\t"}{if($6=="-"){print $1,$3,$5}}' \
${DIR_CAGE}/${FILE}Aligned.sortedByCoord.out_filtered.bed \
| sort -k1,1 -k2,2n \
| groupBy -i stdin -g 1,2 -c 3 -o count \
| awk -v x="$FILE" 'BEGIN{OFS="\t"}{print $1,$2-1,$2, x ,$3,"-"}' >> \
${DIR_CAGE}/CTSS/${FILE}.neg_ctss.bed
# Combine both strands and sort by genomic coordinates
cat ${DIR_CAGE}/CTSS/${FILE}.pos_ctss.bed ${DIR_CAGE}/CTSS/${FILE}.neg_ctss.bed \
| sort -k1,1 -k2,2n > ${DIR_CAGE}/CTSS/${FILE}.ctss.bed
# Clean up intermediate files
rm ${DIR_CAGE}/${FILE}Aligned.sortedByCoord.out_filtered.bed
${DIR_CAGE}/CTSS/${FILE}.pos_ctss.bed
${DIR_CAGE}/CTSS/${FILE}.neg_ctss.bedPurpose: Combine CTSS from all samples and cluster nearby tags into coherent TSS regions.
Performed with CAGEr in R to generate ${DIR_CAGE}/FANTOM5_placenta_CAGE-seq.bed.
Purpose: Integrate multiple lines of evidence for active transcription start sites:
- CAGE-seq: Direct measurement of capped mRNA 5' ends
- DNase-seq: Open chromatin regions indicating accessible promoters
Rationale: Combining both datasets increases sensitivity for TSS detection while maintaining specificity.
DNase-seq peaks of placental tissue (N = 22) retrieved from ENCODE
# Merge DNase-seq peaks across all 22 placental samples
# Remove overlapping regions to create non-redundant peak set
cat ${DIR_DNAse}/*.bed \
| bedtools sort -i - \
| bedtools merge -i - > ${DIR_DNAse}/ENCODE_placenta_DNase-seq.bed
# Combine CAGE-seq & DNase-seq peaks for comprehensive TSS annotation
# This creates a union of both evidence types
cat ${DIR_CAGE}/FANTOM5_placenta_CAGE-seq.bed \
${DIR_DNAse}/ENCODE_placenta_DNase-seq.bed \
| bedtools sort -i - \
| bedtools merge -i - > ${DIR_ORTHOGONAL}/placenta_TSS.bedPurpose: Integrate multiple sources of transcription termination site evidence:
- SAPAS: Sequencing-based poly(A) site identification from tissue samples
- PolyA_DB: Curated database of poly(A) motifs and cleavage sites
Rationale: Combining experimental and curated data provides comprehensive TTS annotation.
- SAPAS peaks in placenta tissue were retrieved from APASdb
- Human polyA motifs annotated in PolyA_DB were retrieved from PolyA-miner
# Combine SAPAS experimental data with curated poly(A) motifs
# Creates comprehensive TTS annotation for transcript 3' end validation
cat ${DIR_ORTHOGONAL}/APASdb_placenta_SAPAS.bed \
${DIR_ORTHOGONAL}/PolyA_DB_motifs.bed \
| bedtools sort -i - \
| bedtools merge -i - > ${DIR_ORTHOGONAL}/placenta_TTS.bed