genomic TU organization, and a machine learning approach is
applied to predict the genomic boundaries of individual TUs
[91]. Then, in addition to statistical model-based algorithms for
in-depth investigation of next-generation sequencing of cancer
genomes, the machine learning approaches (e.g., SNooPer based
on random forest classification models) have been developed to
accurately call somatic variants in low-depth, whole-exome, or
whole-genome sequencing data [39]. With such accurate annota-
tions, large-scale human genetic variation data can be obtained. For
example, the single nucleotide polymorphisms (SNPs) are an
important source of human genome variability and greatly contrib-
ute to human complex diseases, especially the amino acid mutations
resulting from non-synonymous SNPs in coding regions; and the
machine learning approach (e.g., support vector machine) has been
used to predict cancer driver missense variants by training on
cancer-causing variants and neutral polymorphisms with equal sam-
ple number [92]. To further detect the positive selection in those
genomic regions as a natural population genetic study, a machine
learning classification framework has been implemented to com-
bine selection tests to detect the features of polymorphism in hard
sweeping with controls on population-specific demography [45].
On the other hand, the high-throughput sequencing also
allows researchers to examine more details on the transcriptome
or other omics level than ever before (seeNote 2), and a key of
applying machine learning for such omics data is feature selection,
i.e., to reduce the original high-dimensional omics data into a
low-dimensional feature data. The CoRAL (Classification of
RNAs by Analysis of Length) is a computational method for dis-
criminating different classes of RNA, whose selected features are
relevant to small RNA biogenesis pathways [38]. The RGIFE
(Rule-guided Iterative Feature Elimination) is a heuristic method
to select very small set of features by rule-based machine learning
with balance on the objective of minimal features and high predic-
tive power [93]. Based on the widely usedk-top scoring pair (k-
TSP) algorithm, the integration ofk-TSP with other machine
learning methods (e.g., multivariate classifiers such as SVM)
would be a feature selector to tune certain data characteristics,
i.e., correlations among informative genes [94]. More practical,
the clinical application of omics data will ask for marker genes
whose expression patterns will be sufficient to accurately predict
the disease or not, such as the maximum difference
subset algorithm has provided a coherent framework to combine
the classical statistics and elements of machine learning [95]; and a
supervised machine learning approach (radial and linear support
vector machines) is designed to predict disease risk by genotypes
incorporating gene expression data and rare variants [96]; and an
extended computational method based on different machine194 Xiang-tian Yu et al.
