Computational Drug Discovery and Design

(backadmin) #1
that do not appear substantially more frequently in active molecules
than in non-actives (are not discriminatory of activity), for example,
18-methyl, 19-methyl, 3-keto, or the presence of either the C4–C5
or C6–C7 double bond (“DB”), also have a low random forest
feature importance. Interestingly, the feature importance of
Sulfate-Ester is much less than the feature importance of Sulfur or
Sulfur-Oxygens, which may be because it is highly correlated with
the sulfur and sulfur oxygen matches in the sulfate group and
thereby, to some extent, redundant. An alternative explanation is
that the ester oxygen is less highly charged than the terminal sulfate
oxygens (causing it to make weaker hydrogen bonds) and is also less
accessible for interaction with the receptor.
The machinelearningtechniques presented in this chapter can be
usedforanykindofdatafor whichasetof featurevaluesacrossasetof
objects is used to predict activity (or any observable value determined
by an experimental technique, e.g., solubility, selectivity, and reactiv-
ity). We hope this chapter has whetted your appetite for machine
learning, which can be used to fit robust models that relate features of
interest to molecular activity and other observables. The code
provided here and on the corresponding website (https://github.
com/psa-lab/predicting-activity-by-machine-learning) makes it
possible for you to learn and then use these techniques in your own
research. For further information about machine learning, and to
carry out further explorations with prepared datasets or your own
data, we recommend the following tutorials and references: Raschka
and Mirjalili [34], Raschka et al. [35], Friedman et al. [36], Mueller
and Guido [37], and the scikit-learn online tutorials (http://scikit-
learn.org/stable/tutorial/index.html).

4 Notes



  1. For this section, we used a CSV file where the features and
    target variable (signal inhibition) were stored as columns sepa-
    rated by commas. Note that theread_csvfunction does not
    strictly require this input format. For instance, pandas’s
    read_csvfunction supports any possible column delimiter
    (e.g., tabs and whitespaces), which can be specified via the
    delimiter function argument. For more information about the
    read_csvfunction, please refer to the official documentation
    at https://pandas.pydata.org/pandas-docs/stable/gener
    ated/pandas.read_csv.html. Furthermore, if you are planning
    to work with datasets where the features are stored as rows as
    opposed to columns, you can use the transpose method
    (df¼df.transpose()) after loading a dataset to transpose
    the data frame index and columns.


Inferring Activity Discriminants 331
Free download pdf