Computational Drug Discovery and Design

(backadmin) #1

Fig. 11Binary classification tree separating active from non-active compounds. After importing the tree
submodule from the scikit-learn machine learning library, the first line of code initializes a newDecision-
TreeClassifierobject that is then learning the decision rules from the functional group matching
pattern array (X) and the discretized response variable (binary labels of the active and non-active molecules,
y_binary) by calling thefitmethod. The last three lines of code then export the fitted decision tree as a
PDF image, which is shown here. The first node at the top of the tree, for example, uses a decision rule asking
which molecules in the 56-molecule dataset (44 actives and 12 non-actives) match a sulfur group in DKPES.
Note that this question is posed as a conditional (true/false) statement “Molecules donotcontain a sulfur
group match,” due to the implementation of the decision tree in scikit-learn. The molecules for which the
condition is “False”—that is, molecules that do match the sulfur group in DKPES—are then passed to the
child node on the right (here: 4 non-actives and 11 actives), where the next conditional statement is
“Molecules do not contain a ‘Sulfate-Ester’match.” Each node in the tree contains the impurity measure
after the split (Gini impurity), reflecting the degree of separation between active and non-active compounds; a
Gini impurity value of 0 reflects a set containing purely active or non-active compounds. The number of
samples refers to the compounds at each node that pass the filtering criteria. The first value within brackets in
the bottom row in each terminal node denotes the number of non-active compounds at that node, and the
second number denotes the number of active compounds. Highlighted with an asterisk is the terminal node


322 Sebastian Raschka et al.

Free download pdf