ATAC-seq Processing Pipeline

Overview

The 4DN ATAC-seq data processing pipeline uses the ENCODE ATAC-seq pipeline v1.1.1. We have modified the logistics of the pipeline execution without changing the content of the pipeline.

We have split the pipeline into two sub-pipelines; 1) alignment and filtering and 2) peak calling. A quality control report is generated at each step.

For certain cases, we add a replicate-merging step between the two steps (not a part of the ENCODE pipeline). * If an experiment set has >1 biological replicates and >1 technical replicates in a biological replicate, we merge the technical replicates.

We perform the basic QC that comes with the ATAC-seq pipeline, but we do not perform ATAQC, an additional QC that requires a set of reference files that do not yet have official release/documentation, i.e. we use the flag atac.disable_ataqc=True.

For more detail, see description/documentation from ENCODE.

Alignment and filtering

The first step is run on the fastq files that correspond to a single technical replicate (a single technical replicate may contain multiple sequencing replicates).

Reads are aligned to the reference genome with bowtie2 and filtered. The output (tagAlign) is a set of read positions in gzipped bed format.

This step is equivalent of running the ENCODE ATAC-seq pipeline with parameter align_only=True.

A quality control report is linked from the main output tagAlign file.

A more detailed description of this step can be found at : Workflow graph and metadata

Merging

In some cases, replicates are merged after the first step. The merging rule is as below.

  • If an experiment set has more than 1 biological replicate, and each biological replicate has more than 1 technical replicate, then the technical replicates are merged.
  • If an experiment set has 1 biological replicate, the technical replicates are treated as if biological replicates in the subsequent step.

A more detailed description of this step can be found at : Workflow graph and metadata

Peak calling and Quality Report Generation

Using TagAlign files obtained from the earlier step, a signal fold change track (in bigwig format) is calculated using MACS2. Peaks are also called using MACS2 and two final call sets (optimal peaks and conservative peaks, in bigbed format) are reported after applying an overlap method. A quality control report is linked from the output signal fold change bigwig file.

A more detailed description of this step can be found at : Workflow graph and metadata

Source

The pipeline components are pre-installed in a publicly available Docker image on Docker Hub (4dn-dcic/encode-atacseq:v1), which is identical to the ENCODE docker image (quay.io/encode-dcc/atac-seq-pipeline:v1.1.1). The pipeline structure is described in Workflow Description Language (WDL) and has been modified from the original ENCODE WDL. The source code for the Docker image and the WDL code can be found on GitHub.

Latest runs

  • 4DN WDL/Docker : https://github.com/4dn-dcic/atac-seq-pipeline
  • merging (CWL) : https://github.com/4dn-dcic/docker-4dn-mergebed/tree/v1
    • This step is 4DN-specific and is not a part of the ENCODE Docker/WDL. The corresponding docker image is 4dn-dcic/4dn-mergebed:v1
  • Original ENCODE WDL/Docker : https://github.com/ENCODE-DCC/atac-seq-pipeline