Repli-seq Processing Pipeline

Overview

The 4DN Repli-seq data processing pipeline includes read clipping, alignment, filtering, and aggregation. Downstream normalization, smoothing and replicate merging steps will be implemented in the near future.

Read Clipping

Adaptor sequences are clipped from repli-seq reads using cutadapt version 1.14. Specifically, we run:

cutadapt -q 0 -O 1 -m 0 -a <adaptor> <fastq>
  • The -q 0 is used to turn off low-quality base removal before adapter searching.
  • The -0 1 sets the minimum required overlap length between read end and adaptor to be 1 (default is 3), in case the adaptor sequence partially overlaps with the read rather than being contained in a read.
  • The -m 0 means that empty reads are kept and will appear in the output.

AGATCGGAAGAGCACACGTCTG is used as adaptor sequence.

Alignment

Clipped repli-seq reads are mapped to the GRCh38 (human) or mm10 (mouse) reference genome using bwa version 0.7.15. Specifically, we run bwa mem with default options:

bwa mem <genome_index> <fastq1> <fastq2>

Filtering

For filtering valid Repli-seq alignments, we use samtools. Specifically, the filtering workflow consists of the following steps:

  • MAPQ filtering: samtools view command with -q 20 was used to skip alignments with MAPQ smaller than 20.
  • Sorting: samtools sort command was used to sort alignments by genomic coordinates.
  • Removal of PCR duplicates: samtools rmdup command was used to remove duplicate alignments.

Binning and Aggregation

Filtered reads were aggregated for each 5kb window using bedtools coverage. Specifically, the following command was used.

bedtools coverage -counts -sorted -a <BINFILE> -b <INPUT_BAM>

Output is provided in both gzipped bedgraph and bigwig formats and can be viewed using HiGlass.

As of v16.1, the pipeline output includes a raw counts file in addition to the default scaled counts (RPKM).

Source files

The pipeline components are pre-installed in a publicly available Docker image (4dndcic/4dn-repliseq:v16.1) on Docker Hub. The source code for the Docker image and pipeline description in Common Workflow Language (CWL) can be found on GitHub.

  • Latest version (v16.1)
    • Workflow metadata : https://data.4dnucleome.org/workflows/622bdf75-2dd1-457f-ad78-d4cd128f8f5b/
    • CWL : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v16.1/cwl
    • Docker : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v16.1
  • Older versions
  • v16
    • Workflow metadata : https://data.4dnucleome.org/workflows/2a6807f1-93db-4c7b-b148-672534193974/
    • CWL : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v16/cwl
    • Docker : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v16
  • v14
    • Workflow metadata : https://data.4dnucleome.org/workflows/4459a4d8-1bd8-4b6a-b2cc-2506f4270a34/
    • CWL : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v14/cwl
    • Docker : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v14
  • v13.1
    • Workflow metadata : https://data.4dnucleome.org/workflows/146da22a-502d-4500-bf57-a7cf0b4b2364/
    • CWL : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v13.1/cwl
    • Docker : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v13.1