Computational Drug Discovery and Design

(backadmin) #1
that it is easy to interpret as a generalized linear model: The output
always depends on the sum of the inputs and model parameters.
However, note that sequential feature selection can be used with
many different machine learning algorithms for supervised learning
(classification and regression).
To introduce the main concept behind logistic regression,
which is a probabilistic model, we need to introduce the so-called
odds ratiofirst. The odds ratio computes theoddsin favor of a
particular eventE, which is defined as follows, based on the proba-
bilitypof a positive outcome (for instance, the probability that a
molecule is active):

odds¼

p
ðÞ 1 p
Next, we define thelogitfunction, which is the logarithm of the
odds ratio:

logitðÞ¼p log

p
ðÞ 1 p
The logit function takes values in the range 0–1 (the probability
p) and transforms them to real numbers that describe the relation-
ship between the functional group matching patterns, multiplied
with weight coefficients (that need to be learned) and the odds that
a given molecule is active:

logitðÞ¼pyðÞ¼ 1 jx w 1 x 1 þw 2 x 2 þþwmxmþb

¼

Xm

i¼ 1

wixiþb

Here,mis an index over the input features (functional group
matches,x),wrefers to the weight parameters of the parametric
logistic regression model, andbrefers to the y-axis intercept (typi-
cally referred to asbiasorbias unitin literature). The input to the
logit function,p(y¼1|x), is the conditional probability that a
particular molecule is active, given that its functional group
matchesx.
However, since we are interested in modeling the probability
that a given molecule is active, we need to compute the function
inverseφof the logit function, which we can compute as:

φðÞ¼z

1
1 þez
Here,zis a placeholder variable defined as follows:


Xm

i¼ 1

wixiþb

326 Sebastian Raschka et al.

Free download pdf