There exist some tools to remove PCR duplication. Picard
MarkDuplicates compares sequences in the five primary positions
of both reads and read pairs in a SAM/BAM file. After duplicated
reads are marked, this tool differentiates the primary and duplicated
reads using an algorithm ranking reads by the summation of their
base quality scores. However, this tool can result in unwanted
removal of tumor-derived mutated reads, when it shares mapping
coordination with some wild-type reads.
Another approach was introduced by CAPP-seq [37]. It col-
lapses those reads with completely identical sequences except the
reads with ultralow-quality scores. This method is less lossy since it
removes fewer reads comparing with Picard MarkDuplicates. How-
ever, it is usually affected by sequencing errors, so the duplication
level of processed data can still be very high.
Molecular barcoding sequencing, which has been introduced
above, is a new approach that appears to be effective for removing
PCR duplication. Since the UID ligation is performed before any
amplification happens, the reads derived from the same original
DNA will share the same UID. Based on the clustering of UID
and read sequence, the PCR duplication can be detected and the
consensus read generation process will remove the duplicated
reads. Table4 compares existing deduplication tools.
The methods described above detect duplication before calling
variants. An alternative strategy is to detect duplication after variant
calling is done, which collapses the reads with same mapping posi-
tions (start and end) as a unique read and gives the numbers of
reads supporting reference and alternative base for each mutation.
This unique read counting method can provide more accurate
supporting read calculation. With this strategy applied, we can
apply less lossy deduplication methods like CAPP-seq method to
keep more information for variant calling. We can even skip dedu-
plication before variant calling if the variant caller is able to handle
the data with duplication.
MrBam is a tool designed for such unique read counting task. It
differentiates the result reads generated by one single read or
multiple reads sharing same mapping coordination. For
paired-end sequencing data, it differentiates the cases where muta-
tion is located in read pair’s overlapped or non-overlapped region.Table 4
Feature comparison of existing deduplication tools
Information loss Background noise Error correction
Picard MarkDuplicates High Low None
CAPP-seq Low High None
Molecular barcodes Low Low Yes84 Shifu Chen et al.
