Computational Drug Discovery and Design

that do not appear substantially more frequently in active molecules than in non-actives (are not discriminatory of activity), for example, 18-methyl, 19-methyl, 3-keto, or the presence of either the C4–C5 or C6–C7 double bond (“DB”), also have a low random forest feature importance. Interestingly, the feature importance of Sulfate-Ester is much less than the feature importance of Sulfur or Sulfur-Oxygens, which may be because it is highly correlated with the sulfur and sulfur oxygen matches in the sulfate group and thereby, to some extent, redundant. An alternative explanation is that the ester oxygen is less highly charged than the terminal sulfate oxygens (causing it to make weaker hydrogen bonds) and is also less accessible for interaction with the receptor. The machinelearningtechniques presented in this chapter can be usedforanykindofdatafor whichasetof featurevaluesacrossasetof objects is used to predict activity (or any observable value determined by an experimental technique, e.g., solubility, selectivity, and reactiv- ity). We hope this chapter has whetted your appetite for machine learning, which can be used to fit robust models that relate features of interest to molecular activity and other observables. The code provided here and on the corresponding website (https://github. com/psa-lab/predicting-activity-by-machine-learning) makes it possible for you to learn and then use these techniques in your own research. For further information about machine learning, and to carry out further explorations with prepared datasets or your own data, we recommend the following tutorials and references: Raschka and Mirjalili [34], Raschka et al. [35], Friedman et al. [36], Mueller and Guido [37], and the scikit-learn online tutorials (http://scikit- learn.org/stable/tutorial/index.html).

4 Notes

For this section, we used a CSV file where the features and
target variable (signal inhibition) were stored as columns sepa-
rated by commas. Note that theread_csvfunction does not
strictly require this input format. For instance, pandas’s
read_csvfunction supports any possible column delimiter
(e.g., tabs and whitespaces), which can be specified via the
delimiter function argument. For more information about the
read_csvfunction, please refer to the official documentation
at https://pandas.pydata.org/pandas-docs/stable/gener
ated/pandas.read_csv.html. Furthermore, if you are planning
to work with datasets where the features are stored as rows as
opposed to columns, you can use the transpose method
(df¼df.transpose()) after loading a dataset to transpose
the data frame index and columns.

Inferring Activity Discriminants 331

Computational Drug Discovery and Design

Get our desktop app

Company

Features

Documentation

Resources