Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
researchers. We cannot discuss all the aspects of BS-seq in this
chapter. A collection of BS-seq data analysis tools and pipelines
can be found in OMIC tools online (https://omictools.com/bs-
seq-category).

2.7 Machine
Learning Methods


Machine learning (ML) technologies are very popular for creating
data models in lots of domains, and it can also be applied into
ctDNA data analysis. Most applicable methods are supervised
learning methods, which build classifiers based on training from
labeled data. In this subsection, we will show how to use ML
technology to build classifiers with ctDNA sequencing data.
One ML application is to classify cfDNA data and non-cfDNA
data. CfDNA has certain fragmentation patterns, which can bring
nonrandom base content curves of the sequencing data’s beginning
cycles. The cfDNA fragmentation patterns were first reported by
Chandrananda et al. at one nucleotide resolution in 2014
[55]. They found some high frequency 10-nucleotide motifs on
either side of cfDNA fragments, and the first two bases of the
cfDNA at cleavage site could determine most of the other eight
bases. His further study in 2015 indicated that these fragmentation
patterns were related to the nonrandom biological cleavage over
chromosomes. The ten positions on either side of the DNA cleav-
age site show consistent patterns with preference of specific nucleo-
tides for nucleosomal cores and linker regions. Figure7 shows the
fragmentation pattern of plasma cfDNA sequencing data.
Since this fragmentation pattern of cfDNA is stable and unique,
it can be used to differentiate data of cfDNA and data of other kinds
of samples. The authors have developed an open source tool, called
CfdnaPattern, to train classifiers like SVM, KNN, or random forest
to predict whether a FASTQ is sequenced from cfDNA or not.
Cross validation using 0.632+ bootstrapping [56] with more than
3000 FASTQ files gave a result of 99.8% average accuracy, obtained
with random forest, linear SVM, or KNN classifiers. This tool is
written in Python, with the widely used Python machine learning
package scikit-learn. This tool is available at:https://github.com/
OpenGene/CfdnaPattern.
Another ML application is to predict whether a mutation is
somatic or germline. Typically, tumor and normal samples are both

Fig. 7The cfDNA fragmentation pattern. This figure shows content curves at the first ten cycles of plasma
cfDNA sequencing data. This pattern is found stable and can be repeated by different plasma cfDNA samples


88 Shifu Chen et al.

Free download pdf