UserContent
released
October 4th, 2021 at 2:36pm
Cleaning and Alignment
In iMARGI experiments, two random bases initiate each RNA end read. Thus to improve mapping, R1 reads are cleaned using seqtk version 1.3. The command
seqtk trimfq -b 2
removes two bases () from the left end of each read.
Reads are then mapped to the GRCh38 (human) or mm10 (mouse) reference genome using bwa version 0.7.17. In particular, we run:
bwa mem -t <nthreads> -SP5M <genome_index> <fastq1> <fastq2>
- The
-SPoption is used to ensure the results are equivalent to that obtained by runningbwa memon each mate separately, while retaining the right formatting for paired-end reads. This option skips a step inbwa memthat forces alignment of a poorly aligned read given an alignment of its mate with the assumption that the two mates are part of a single genomic segment. - The
-5option is used to report the 5' portion of chimeric alignments as the primary alignment. For chimeric alignments,bwa memreports two alignments: one of them is annotated as primary and soft-clipped, retaining the full-length of the original sequence. The other end is annotated as hard-clipped and marked as either 'supplementary' or 'secondary'. - The
-Moption is used to annotate the secondary/supplementary clipped reads as secondary rather than supplementary, for compatibility with some public software tools such aspicard MarkDuplicates. - The
-toption is used for multi-threading and should not affect the result.
Source files (v1.1.1_dcic_4):
- Workflow: https://data.4dnucleome.org/workflows/4DNWFMRGIPA1/
- CWL: https://github.com/4dn-dcic/iMARGI-Docker/blob/v1.1.1_dcic_4/src/cwl/imargi-processing-fastq.cwl