Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
ed.(R 1 o1,R 2 o1)>ed.(R 1 o,R 2 o)<ed.(R 1 o+1,R 2 o+ 1).
Figure4 shows an example of how AfterQC’s overlapping analysis
works.
Based on overlapping analysis, AfterQC can detect mismatches.
If the mismatched pair has unbalanced quality scores, which means
one base has high-quality score (i.e.,>Q30) and the other has very
low-quality score (i.e.,<Q15), AfterQC can automatically correct
the base with low quality. If the quality scores are not unbalanced,
AfterQC can mask them by changing the bases to N or assigning
zero quality scores to them. Based on the mismatches, AfterQC can
evaluate the sequencing error rate and profile the sequencing
error transform distribution (i.e., how many bases are T but
sequenced as C).
Overlapping analysis can be used for automatic adapter cutting.
In the overlapping analysis process, we get the optimal offsetOfor
the best local alignment of each pair. The overlapping length of this
pair can be directly calculated using the offsetO.IfOis found
negative, the bases outside overlapping region will be considered as
a part of adapter sequences and then be cut automatically.
AfterQC is an open source tool: https://github.com/
OpenGene/AfterQC. It is implemented in Python and C++, with
PyPy support enabled. AfterQC generates a standalone HTML
report for each input, with figures plot by Plotly. A sample report
can be found at:http://opengene.org/AfterQC/report.html.

2.2 Molecular
Barcoding Sequencing
and Its Data Analysis


The potential of NGS deep sequencing for ctDNA was hampered
by systemic errors introduced by PCR and sequencing methods
[27, 28]. Molecular indexing combined with deep sequencing
holds great promise to break the limit imposed by PCR and
sequencing errors and enables the detection of rare and ultra-rare
mutations [29, 30].
Tagging individual templates with molecular barcodes has been
proposed and reported since 2007 [31]. The molecular barcodes or
molecular indexes have been given various names, such as unique
identifiers (UID) [29], unique molecular identifiers (UMI) [32],
primer ID [30], duplex barcodes [33], etc. They are usually
designed as a string of totally random nucleotides (such as
NNNNNNNN), partially degenerate nucleotides (such as

Fig. 4How AfterQC’s overlapping analysis works. The edit distance of the overlapped subsequences is 1. A
mismatch pair is found with a high-quality baseAand a very low-quality baseT. ThisTwill be recognized as
wrongly represented and can be corrected


78 Shifu Chen et al.

Free download pdf