RNA-seq Processing Pipeline

Overview

The 4DN RNA-seq data processing pipeline uses the ENCODE RNA-seq pipeline v1.1. We have modified the logistics of the pipeline execution without changing the content of the pipeline, except we have excluded the Kallisto run which is a dispensible addition to the full pipeline based on STAR/RSEM. In addition, we modified MAD QC to handle more than two biological/technical replicates.

We have run the pipeline individually for each biological/technical replicate, merging only sequencing replicates at the fastq level. The output bam, bw and expression tsv files are generated for each biological/technical replicate. The entire pipeline was run as a single execution unit.

A more detailed description of the 4DN RNA-seq pipeline can be found at :

For more detail, see description/documentation from ENCODE.

Alignment

The first step is run on the fastq files that correspond to a single biological or technical replicate (which may contain multiple sequencing replicates). The fastq files may be either single-ended or paired-ended.

Reads are first merged and then aligned to both the reference genome and transcriptome with STAR (version 2.5.1b). The genome-mapped output (bam) is kept and used in the next step to create a read coverage bw files. The transcriptome-mapped output (bam) is passed on to another step that uses RSEM to quantify gene and isoform expression levels.

Expression quantification

Gene and isoform expression quantification is performed using RSEM (version 1.2.31), based on the transcriptome-aligned bam file passed on from the previous step. The output consists of two tsv files - gene_expression and isoform_expression.

Read coverage

Read coverage tracks are generated in the BigWig (bw) format from the genome-mapped bam file. For stranded RNA-seq data, two bw files are generated, plus_bw and minus_bw, each corresponding to one of the two genomic strands. For unstranded data, a single bw file is generated (outbw) that covers both strands of the genome.

Quality Control Metrics

Quality control metrics are calculated and linked from two output files - the genome-mapped bam file and the gene_expression tsv file. The latter includes quality control metrics for transcriptome-mapped bam file since we do not keep this bam file which was used for expression quantification.

Median Absolute Deviation (MAD) Quality Control Metric

Created independently from other quality control metrics, the MAD quality control metric is generated using gene_expression or isoform_expression tsv files from two or more biological/technical replicates (produced in the expression quantification step). The output is an html file summary containing various measures of the expression values—Spearman correlations, Pearson correlations, standard deviations and MAD of log ratios, and MA plots—for all pairs of tsv files.

Source files

The pipeline components are pre-installed in a publicly available Docker image on Docker Hub (4dndcic/encode-rnaseq:v1.1_dcic_2), which was created from the ENCODE docker image (quay.io/encode-dcc/rna-seq-pipeline:v1.1). The pipeline structure is described in Workflow Description Language (WDL) and has been modified from the original ENCODE WDL. The source code for the Docker image and the WDL code can be found on GitHub.

The 4DN modifications include:

  • Exclusion of Kallisto
  • Use of global cpu/mem variables
  • Specifying Docker image explicitly for each task
  • Accepting more than 2 replicates for MAD QC

Latest runs

  • 4DN WDL/Docker : https://github.com/4dn-dcic/rna-seq-pipeline
  • Original ENCODE WDL/Docker : https://github.com/ENCODE-DCC/rna-seq-pipeline

Strandedness test

In addition to the ENCODE RNA-seq pipeline, we added a single-step pre-check workflow that determines strandedness of the RNA-seq data, to double-confirm the reported strandedness. This step was run before running the RNA-seq pipeline.

There are three possible values for strandedness:

  • unstranded
  • stranded forward
  • stranded reverse

We determine this based on the ratio of a beta-actin sense strand 21-mer (S) to the antisense 21-mer (A) the fastq file.

  • single-end
  • S / (S+A) > 0.8 : stranded forward
  • S / (S+A) < 0.2 : stranded reverse
  • 0.2 < S / (S+A) < 0.8 : unstranded

  • paired-end

  • S / (S+A) < 0.2 in R2 and S / (S+A) > 0.8 in R1 : stranded forward
  • S / (S+A) < 0.2 in R1 and S / (S+A) > 0.8 in R2 : stranded reverse
  • 0.2 < S / (S+A) < 0.8 in both R1 and R2 : unstranded

The pipeline components are pre-installed in a publicly available Docker image on Docker Hub (4dndcic/4dn-rna-strandness:v2). The source code for CWL/Docker can be found at : https://github.com/4dn-dcic/docker-4dn-rna-strandedness