The 4DN Repli-seq data processing pipeline includes read clipping, alignment, filtering, and aggregation. Downstream normalization, smoothing and replicate merging steps will be implemented in the near future.
Overview
Read Clipping
Adaptor sequences are clipped from repli-seq reads using cutadapt
version 1.14. Specifically, we run:
cutadapt -q 0 -O 1 -m 0 -a <adaptor> <fastq>
- The
-q 0
is used to turn off low-quality base removal before adapter searching. - The
-0 1
sets the minimum required overlap length between read end and adaptor to be 1 (default is 3), in case the adaptor sequence partially overlaps with the read rather than being contained in a read. - The
-m 0
means that empty reads are kept and will appear in the output.
AGATCGGAAGAGCACACGTCTG is used as adaptor sequence.
Alignment
Filtering
For filtering valid Repli-seq alignments, we use samtools
.
Specifically, the filtering workflow consists of the following
steps:
- MAPQ filtering:
samtools view
command with-q 20
was used to skip alignments with MAPQ smaller than 20. - Sorting:
samtools sort
command was used to sort alignments by genomic coordinates. - Removal of PCR duplicates:
samtools rmdup
command was used to remove duplicate alignments.
Binning and Aggregation
Filtered reads were aggregated for each 5kb window using bedtools coverage
. Specifically, the following command was used.
bedtools coverage -counts -sorted -a <BINFILE> -b <INPUT_BAM>
Output is provided in both gzipped bedgraph
and bigwig
formats and can be viewed using HiGlass.
As of v16.1, the pipeline output includes a raw counts file in addition to the default scaled counts (RPKM).
Source files
The pipeline components are pre-installed in a publicly
available Docker image (4dndcic/4dn-repliseq:v16.1
) on
Docker Hub. The source code for the Docker image and pipeline
description in Common Workflow Language (CWL) can be found on
GitHub.
- Latest version (v16.1)
- Workflow metadata : https://data.4dnucleome.org/workflows/622bdf75-2dd1-457f-ad78-d4cd128f8f5b/
- CWL : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v16.1/cwl
- Docker : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v16.1
- Older versions
- v16
- Workflow metadata : https://data.4dnucleome.org/workflows/2a6807f1-93db-4c7b-b148-672534193974/
- CWL : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v16/cwl
- Docker : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v16
- v14
- Workflow metadata : https://data.4dnucleome.org/workflows/4459a4d8-1bd8-4b6a-b2cc-2506f4270a34/
- CWL : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v14/cwl
- Docker : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v14
- v13.1
- Workflow metadata : https://data.4dnucleome.org/workflows/146da22a-502d-4500-bf57-a7cf0b4b2364/
- CWL : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v13.1/cwl
- Docker : https://github.com/4dn-dcic/docker-4dn-repliseq/tree/v13.1