and Fukui reactivity indices, were selected to build the prediction
models. The GA-SVM model showed superiority to the linear
models in the refs.37, 38. However, the relationship between the
toxicity values and the selected descriptors is not explicit because of
the nature of the SVM model.
Lei et al. employed seven machine learning methods, including
SVM, relevance vector machine (RVM),k-nearest neighbor (kNN),
random forest (RF), local approximate Gaussian process [39, 40],
multilayer perceptron ensemble [41], and eXtreme gradient boost-
ing [42], to predict acute oral toxicity in rats based on 7314 diverse
compounds [43]. RVM, which is a sparse Bayesian learning algo-
rithm developed from the standard SVM [44, 45], showed better
prediction ability than other models. Furthermore, the authors
captured the important descriptors and fragments for acute toxicity
by using multiple statistic methods. For example, one-dimensional
sensitivity analysis indicated that descriptors associated with molec-
ular polarity, molecular reactivity, and intramolecular interactions
gave more contributions to acute toxicity than other descriptors.
TheR^2 adjchange in the stepwise regression and Cramer’s V coeffi-
cient demonstrated that nine fragments, such as trifluoromethyl
and heterocyclic, made positive contributions to high pLD 50 and
four fragments, such as the count of nitrogen atoms and carbon-
nitrogen double bond, had contrary effects. The analyses of
descriptors and fragments based on such a large and structurally
diverse data set can provide some instructions for designing drug
candidates with lower toxicity.
An effective strategy to improve the prediction accuracy of the
models for chemically diverse data sets is to divide the data set into
some subsets based on structural features or mechanisms and build
a local model for each subset. For example, kNN algorithm
[46, 47], following the idea that “structurally similar chemicals
are likely to have similar properties” [48], extractsknearest neigh-
bors from the training set for the query compound and explores
local structure-activity relationships using thesekneighbors instead
of the global data set.
Zhu et al. employed multiple machine learning approaches,
includingkNN, RF [49], hierarchical clustering (HC) [50], nearest
neighbor, and FDA MDL QSAR [51], to develop prediction mod-
els based on 7385 compounds. For eliminating the outliers, the
distance-based methods [52–54] were used to define the applica-
bility domain of the prediction models. The statistic results indi-
cated that thekNN and RF models yielded goodR^2 and low MAE,
but at the expense of the low coverage of the test set (19%).
Moreover, the authors built the consensus model, in which the
predicted toxicity for each compound equals to the arithmetical
average of all predicted values of the individual models, to reduce
250 Jing Lu et al.