StaticSection Filtering

UserContent
released
   October 6th, 2021 at 8:17pm

Filtering


Alignments in bam format are sorted and duplicates marked with Picard version 2.20.7. This consists of two steps:

  • Sorting bams with SortSam with SORT_ORDER=coordinate to specify that the input should be sorted by coordinate (used in the second step).
  • Marking duplicates for removal with MarkDuplicates.

In both steps, the flag VALIDATION_STRINGENCY=LENIENT is used to specify a more relaxed validation stringency (relative to STRICT). Refer to the Picard documentation for further details.

Duplicates are then removed using samtools version 1.9. Specifically, the command is:

samtools view -F 1024 -f 2 -b <input.bam>
  • -F 1024 omits reads marked as PCR or optical duplicates
  • -f 2 restricts the results to reads mapped in a proper pair

The alignments are converted into bedpe format using bedtools version 2.29.0:

bedtools bamtobed -i <input.bam> -bedpe > <input.bedpe>

Finally, the files pass through a final set of cleaning and sorting recommended for peak calling with SEACR (see section below):

awk '$1==$4 && $1!="." && $6-$2 < 1000 {print $0}' <input.bedpe> | cut -f 1-6 | sort -k1,1 -k2,2n -k3,3n
  • awk '$1==$4 specifies that mates must be located on the same chromosome,
  • $1!="." specifies that mates cannot be null,
  • $6-$2 < 1000 specifies that mates' matched ends cannot be more than 1000 bases apart, and
  • cut -f 1-6 retains only the first six columns of the bam file.