StaticSection
Cleaning and Alignment

UserContent
released
   October 4th, 2021 at 2:36pm

Cleaning and Alignment


In iMARGI experiments, two random bases initiate each RNA end read. Thus to improve mapping, R1 reads are cleaned using seqtk version 1.3. The command

seqtk trimfq -b 2

removes two bases () from the left end of each read.

Reads are then mapped to the GRCh38 (human) or mm10 (mouse) reference genome using bwa version 0.7.17. In particular, we run:

bwa mem -t <nthreads> -SP5M <genome_index> <fastq1> <fastq2>
  • The -SP option is used to ensure the results are equivalent to that obtained by running bwa mem on each mate separately, while retaining the right formatting for paired-end reads. This option skips a step in bwa mem that forces alignment of a poorly aligned read given an alignment of its mate with the assumption that the two mates are part of a single genomic segment.
  • The -5 option is used to report the 5' portion of chimeric alignments as the primary alignment. For chimeric alignments, bwa mem reports two alignments: one of them is annotated as primary and soft-clipped, retaining the full-length of the original sequence. The other end is annotated as hard-clipped and marked as either 'supplementary' or 'secondary'.
  • The -M option is used to annotate the secondary/supplementary clipped reads as secondary rather than supplementary, for compatibility with some public software tools such as picard MarkDuplicates.
  • The -t option is used for multi-threading and should not affect the result.

Source files (v1.1.1_dcic_4):

  • Workflow: https://data.4dnucleome.org/workflows/4DNWFMRGIPA1/
  • CWL: https://github.com/4dn-dcic/iMARGI-Docker/blob/v1.1.1_dcic_4/src/cwl/imargi-processing-fastq.cwl