Computational Systems Biology Methods and Protocols.7z

and Fukui reactivity indices, were selected to build the prediction models. The GA-SVM model showed superiority to the linear models in the refs.37, 38. However, the relationship between the toxicity values and the selected descriptors is not explicit because of the nature of the SVM model. Lei et al. employed seven machine learning methods, including SVM, relevance vector machine (RVM),k-nearest neighbor (kNN), random forest (RF), local approximate Gaussian process [39, 40], multilayer perceptron ensemble [41], and eXtreme gradient boost- ing [42], to predict acute oral toxicity in rats based on 7314 diverse compounds [43]. RVM, which is a sparse Bayesian learning algorithm developed from the standard SVM [44, 45], showed better prediction ability than other models. Furthermore, the authors captured the important descriptors and fragments for acute toxicity by using multiple statistic methods. For example, one-dimensional sensitivity analysis indicated that descriptors associated with molecular polarity, molecular reactivity, and intramolecular interactions gave more contributions to acute toxicity than other descriptors. TheR^2 adjchange in the stepwise regression and Cramer’s V coeffi- cient demonstrated that nine fragments, such as trifluoromethyl and heterocyclic, made positive contributions to high pLD 50 and four fragments, such as the count of nitrogen atoms and carbon- nitrogen double bond, had contrary effects. The analyses of descriptors and fragments based on such a large and structurally diverse data set can provide some instructions for designing drug candidates with lower toxicity. An effective strategy to improve the prediction accuracy of the models for chemically diverse data sets is to divide the data set into some subsets based on structural features or mechanisms and build a local model for each subset. For example, kNN algorithm [46, 47], following the idea that “structurally similar chemicals are likely to have similar properties” [48], extractsknearest neigh- bors from the training set for the query compound and explores local structure-activity relationships using thesekneighbors instead of the global data set. Zhu et al. employed multiple machine learning approaches, includingkNN, RF [49], hierarchical clustering (HC) [50], nearest neighbor, and FDA MDL QSAR [51], to develop prediction models based on 7385 compounds. For eliminating the outliers, the distance-based methods [52–54] were used to define the applica- bility domain of the prediction models. The statistic results indicated that thekNN and RF models yielded goodR^2 and low MAE, but at the expense of the low coverage of the test set (19%). Moreover, the authors built the consensus model, in which the predicted toxicity for each compound equals to the arithmetical average of all predicted values of the individual models, to reduce

250 Jing Lu et al.

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources