UserContent
released
October 6th, 2021 at 8:17pm
Filtering
Alignments in bam format are sorted and duplicates marked with Picard version 2.20.7. This consists of two steps:
- Sorting bams with
SortSamwithSORT_ORDER=coordinateto specify that the input should be sorted by coordinate (used in the second step). - Marking duplicates for removal with
MarkDuplicates.
In both steps, the flag VALIDATION_STRINGENCY=LENIENT is used to specify a more relaxed validation stringency (relative to STRICT). Refer to the Picard documentation for further details.
Duplicates are then removed using samtools version 1.9. Specifically, the command is:
samtools view -F 1024 -f 2 -b <input.bam>
- -F 1024 omits reads marked as PCR or optical duplicates
- -f 2 restricts the results to reads mapped in a proper pair
The alignments are converted into bedpe format using bedtools version 2.29.0:
bedtools bamtobed -i <input.bam> -bedpe > <input.bedpe>
Finally, the files pass through a final set of cleaning and sorting recommended for peak calling with SEACR (see section below):
awk '$1==$4 && $1!="." && $6-$2 < 1000 {print $0}' <input.bedpe> | cut -f 1-6 | sort -k1,1 -k2,2n -k3,3n
- awk '$1==$4 specifies that mates must be located on the same chromosome,
- $1!="." specifies that mates cannot be null,
- $6-$2 < 1000 specifies that mates' matched ends cannot be more than 1000 bases apart, and
- cut -f 1-6 retains only the first six columns of the bam file.