ChIP-seq Processing Pipeline

Overview

The 4DN ChIP-seq data processing pipeline uses the ENCODE ChIP-seq pipeline v2.1.6. We have modified the logistics of the pipeline execution without changing the content of the pipeline.

We have split the pipeline into three sub-pipelines; 1) alignment and filtering for ChIP, 2) alignment and filtering for control input, and 3) peak calling. Quality control report is generated at each step.

For more detail, see description/documentation from ENCODE

Alignment and filtering for ChIP

The first step is run on the fastq files that correspond to a single technical replicate (a single technical replicate may contain multiple sequencing replicates).

Reads are aligned to the reference genome with bwa and filtered. The output (tagAlign) is a set of read positions in gzipped bed format.

Additionally, a tagAlign file from an unfiltered data is generated using only R1 (treating as single-ended) is prepared for later quality control step.

This step is equivalent of running the ENCODE ChIP-seq pipeline with parameter align_only=True, using only ChIP data without control input.

A quality control report is linked from the main output tagAlign file.

A more detailed description of this step can be found at : Workflow graph and metadata

Alignment and filtering for control input

Reads are aligned to the reference genome with bwa and filtered. The output (tagAlign) is a set of read positions in gzipped bed format.

This step is equivalent of running the ENCODE ChIP-seq pipeline with parameter align_only=True, but with only control input data.

A quality control report is linked from the output tagAlign file.

A more detailed description of this step can be found at : Workflow graph and metadata

Peak calling and Quality Report Generation

Using TagAlign files from the ChIP and control input, a signal fold change track (in bigwig format) is calculated using MACS2. Peaks are called using either SPP (TF) or MACS2 (histone) and two final call sets (optimal peaks and conservative peaks, in bigbed format) are reported after applying either an IDR (TF) or an overlap (histone) method. When there is no input control to use, MACS2 is used for both TF and Histone types.

A third input TagAlign file from unfiltered ChIP data using only R1 (treating as single-ended) is used to calculate cross correlation. Quality control report is linked from the output signal fold change bigwig file.

A more detailed description of this step can be found at : Workflow graph and metadata

Source files

The pipeline components are pre-installed in a publicly available Docker image on Docker Hub (4dn-dcic/encode-chipseq:v2.1.6), which is adapted from the ENCODE docker image (quay.io/encode-dcc/chip-seq-pipeline:v2.1.6). The pipeline structure is described in Workflow Description Language (WDL) and has been modified from the original ENCODE WDL. The source code for the Docker image and the WDL code can be found on GitHub.

Latest runs:

  • 4DN WDL/Docker : https://github.com/4dn-dcic/chip-seq-pipeline2
  • Original ENCODE WDL/Docker : https://github.com/ENCODE-DCC/chip-seq-pipeline2

The Docker image for the previous version of the pipeline (4dn-dcic/encode-chipseq:v1.1.1) is identical to the ENCODE docker image (quay.io/encode-dcc/chip-seq-pipeline:v1.1.1). The WDL for this version has also been modified from the original ENCODE WDL.