Computational Systems Biology Methods and Protocols.7z

genomic TU organization, and a machine learning approach is applied to predict the genomic boundaries of individual TUs [91]. Then, in addition to statistical model-based algorithms for in-depth investigation of next-generation sequencing of cancer genomes, the machine learning approaches (e.g., SNooPer based on random forest classification models) have been developed to accurately call somatic variants in low-depth, whole-exome, or whole-genome sequencing data [39]. With such accurate annota- tions, large-scale human genetic variation data can be obtained. For example, the single nucleotide polymorphisms (SNPs) are an important source of human genome variability and greatly contrib- ute to human complex diseases, especially the amino acid mutations resulting from non-synonymous SNPs in coding regions; and the machine learning approach (e.g., support vector machine) has been used to predict cancer driver missense variants by training on cancer-causing variants and neutral polymorphisms with equal sam- ple number [92]. To further detect the positive selection in those genomic regions as a natural population genetic study, a machine learning classification framework has been implemented to combine selection tests to detect the features of polymorphism in hard sweeping with controls on population-specific demography [45]. On the other hand, the high-throughput sequencing also allows researchers to examine more details on the transcriptome or other omics level than ever before (seeNote 2), and a key of applying machine learning for such omics data is feature selection, i.e., to reduce the original high-dimensional omics data into a low-dimensional feature data. The CoRAL (Classification of RNAs by Analysis of Length) is a computational method for dis- criminating different classes of RNA, whose selected features are relevant to small RNA biogenesis pathways [38]. The RGIFE (Rule-guided Iterative Feature Elimination) is a heuristic method to select very small set of features by rule-based machine learning with balance on the objective of minimal features and high predic- tive power [93]. Based on the widely usedk-top scoring pair (k- TSP) algorithm, the integration ofk-TSP with other machine learning methods (e.g., multivariate classifiers such as SVM) would be a feature selector to tune certain data characteristics, i.e., correlations among informative genes [94]. More practical, the clinical application of omics data will ask for marker genes whose expression patterns will be sufficient to accurately predict the disease or not, such as the maximum difference subset algorithm has provided a coherent framework to combine the classical statistics and elements of machine learning [95]; and a supervised machine learning approach (radial and linear support vector machines) is designed to predict disease risk by genotypes incorporating gene expression data and rare variants [96]; and an extended computational method based on different machine

194 Xiang-tian Yu et al.

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources