Alignments in bam
format are sorted and duplicates marked with Picard
version 2.20.7. This consists of two steps:
- Sorting bams with
SortSam
with SORT_ORDER=coordinate
to specify that the input should be sorted by coordinate (used in the second step). - Marking duplicates for removal with
MarkDuplicates
.
In both steps, the flag VALIDATION_STRINGENCY=LENIENT
is used to specify a more relaxed validation stringency (relative to STRICT
). Refer to the Picard documentation for further details.
Duplicates are then removed using samtools
version 1.9. Specifically, the command is:
samtools view -F 1024 -f 2 -b <input.bam>
-F 1024
omits reads marked as PCR or optical duplicates-f 2
restricts the results to reads mapped in a proper pair
The alignments are converted into bedpe
format using bedtools
version 2.29.0:
bedtools bamtobed -i <input.bam> -bedpe > <input.bedpe>
Finally, the files pass through a final set of cleaning and sorting recommended for peak calling with SEACR (see section below):
awk '$1==$4 && $1!="." && $6-$2 < 1000 {print $0}' <input.bedpe> | cut -f 1-6 | sort -k1,1 -k2,2n -k3,3n
awk '$1==$4
specifies that mates must be located on the same chromosome,$1!="."
specifies that mates cannot be null,$6-$2 < 1000
specifies that mates' matched ends cannot be more than 1000 bases apart, andcut -f 1-6
retains only the first six columns of the bam file.