iMARGI Processing Pipeline

Overview

MARGI is a protocol for mapping RNA-DNA contacts on a genome-wide scale, analogous to Hi-C methods which map DNA-DNA contacts. In situ MARGI (iMARGI) is the successor of the MARGI technique, requiring fewer input cells and less time than required by MARGI.

The 4DN iMARGI data processing pipeline is adapted from the Zhong iMARGI pipeline. Its primary components are cleaning and alignment of reads, parsing of alignments into pairs, and merging and aggregation of pairs. To learn more about the original pipeline or experimental protocol, please reference the iMARGI Pipeline documentation.

The primary modifications are:

  • An additional output pairs file from the parsing step
  • The addition of 4DN standard resolutions and modification of flag usage in the creation of cool files
  • Swapping of column order for DNA and RNA in cooler cload pairs, and
  • Additional test files and CWLs for running the pipeline.

The iMARGI Docker, used in all steps of the pipeline, can be found at https://hub.docker.com/r/4dndcic/imargi/v1.1.1_dcic_4

Cleaning and Alignment

In iMARGI experiments, two random bases initiate each RNA end read. Thus to improve mapping, R1 reads are cleaned using seqtk version 1.3. The command

seqtk trimfq -b 2

removes two bases (-b 2) from the left end of each read.

Reads are then mapped to the GRCh38 (human) or mm10 (mouse) reference genome using bwa version 0.7.17. In particular, we run:

bwa mem -t <nthreads> -SP5M <genome_index> <fastq1> <fastq2>
  • The -SP option is used to ensure the results are equivalent to that obtained by running bwa mem on each mate separately, while retaining the right formatting for paired-end reads. This option skips a step in bwa mem that forces alignment of a poorly aligned read given an alignment of its mate with the assumption that the two mates are part of a single genomic segment.
  • The -5 option is used to report the 5' portion of chimeric alignments as the primary alignment. For chimeric alignments, bwa mem reports two alignments: one of them is annotated as primary and soft-clipped, retaining the full-length of the original sequence. The other end is annotated as hard-clipped and marked as either 'supplementary' or 'secondary'.
  • The -M option is used to annotate the secondary/supplementary clipped reads as secondary rather than supplementary, for compatibility with some public software tools such as picard MarkDuplicates.
  • The -t option is used for multi-threading and should not affect the result.

Source files (v1.1.1_dcic_4):

  • Workflow: https://data.4dnucleome.org/workflows/4DNWFMRGIPA1/
  • CWL: https://github.com/4dn-dcic/iMARGI-Docker/blob/v1.1.1_dcic_4/src/cwl/imargi-processing-fastq.cwl

Parsing

Interaction pairs are parsed from the bam files using pairtools version 0.2.2. Filtering consists of several commands:

  • pairtools parse
  • Produces a pairsam file from an input bam file.
  • The pairsam file is a pairs file, listing one read pair per line, with additional columns to track the sam-file lines, and a pairtools read classification.
  • These classifications include information on whether the read aligned to 0, 1, or multiple places in the genome and whether it aligned end-to-end or if it was clipped.
  • This tool also upper-triangularizes the reads, i.e. if the coordinate of second read is higher than the first, the reads are flipped.
  • For more details, see the pairtools documentation.

  • pairtools sort

  • Produces a sorted pairsam file from an input pairsam file.
  • Note that the flipping order and sort order of chromosomes is not identical. See the docs for more details.

  • pairtools dedup --mark-dups

  • (equivalent to pairtools markasdup)
  • Identify duplicate alignments.
  • Arbitrarily designate the duplicate status among the two duplicate alignments.

  • pairtools select

  • Remove duplicates, multi-mapped reads, and reads non-uniquely mapped at the 5' end.

Source files (v1.1.1_dcic_4):

  • Workflow: https://data.4dnucleome.org/workflows/4DNWFMRGIPB1/
  • CWL: https://github.com/4dn-dcic/iMARGI-Docker/blob/v1.1.1_dcic_4/src/cwl/imargi-processing-bam.cwl

Aggregation

Pairs are merged and aggregated with pairix version 0.3.3. Pairs files are then converted to mcool via cooler version 0.8.5.

Merging:

  • There is no merging of sequencing replicates. Processing is performed separately for each sequencing replicate.
  • Biological replicates are merged using the same method as used by the Hi-C processing pipeline. That is,
  • Biological replicates are merged after the duplicate removal step, since PCR duplication events happen independently in each replicate.
  • Merging is performed on pairs files using run-merge-pairs.sh.
  • 4DN DCIC provides a merged output as a merged pairs file.

File Format Conversion:

  • mcool files are contact matrices containing multiple resolutions which can be visualized in HiGlass.
  • The 4DN standard resolutions for mcool files are: 1kb, 2kb, 5kb, 10kb, 25kb, 50kb, 100kb, 250kb, 500kb, 1Mb, 2.5Mb, 5Mb, 10Mb.

Source files (v1.1.1_dcic_4):

  • Workflow: https://data.4dnucleome.org/workflows/4DNWFMRGIPC1/
  • CWL: https://github.com/4dn-dcic/iMARGI-Docker/blob/v1.1.1_dcic_4/src/cwl/imargi-processing-pairs.cwl
  • Docker (for merging only): https://github.com/4dn-dcic/docker-4dn-hic/tree/v43

QC

The 4DN version of the iMARGI pipeline also contains an output QC report and summary statistics generated on output pairs files. See an example report here.

Source files (v1.1.1_dcic_4)

  • CWL: https://github.com/4dn-dcic/iMARGI-Docker/blob/v1.1.1_dcic_4/src/cwl/imargi_qc.cwl
  • Script: https://github.com/4dn-dcic/iMARGI-Docker/blob/v1.1.1_dcic_4/src/scripts/imargi_stats.sh