Computational Drug Discovery and Design

that it is easy to interpret as a generalized linear model: The output always depends on the sum of the inputs and model parameters. However, note that sequential feature selection can be used with many different machine learning algorithms for supervised learning (classification and regression). To introduce the main concept behind logistic regression, which is a probabilistic model, we need to introduce the so-called odds ratiofirst. The odds ratio computes theoddsin favor of a particular eventE, which is defined as follows, based on the proba- bilitypof a positive outcome (for instance, the probability that a molecule is active):

odds¼

p ðÞ 1 p Next, we define thelogitfunction, which is the logarithm of the odds ratio:

logitðÞ¼p log

p ðÞ 1 p The logit function takes values in the range 0–1 (the probability p) and transforms them to real numbers that describe the relation- ship between the functional group matching patterns, multiplied with weight coefficients (that need to be learned) and the odds that a given molecule is active:

logitðÞ¼pyðÞ¼ 1 jx w 1 x 1 þw 2 x 2 þþwmxmþb

¼

Xm

i¼ 1

wixiþb

Here,mis an index over the input features (functional group matches,x),wrefers to the weight parameters of the parametric logistic regression model, andbrefers to the y-axis intercept (typi- cally referred to asbiasorbias unitin literature). The input to the logit function,p(y¼1|x), is the conditional probability that a particular molecule is active, given that its functional group matchesx. However, since we are interested in modeling the probability that a given molecule is active, we need to compute the function inverseφof the logit function, which we can compute as:

φðÞ¼z

1 1 þez Here,zis a placeholder variable defined as follows:

z¼

Xm

i¼ 1

wixiþb

326 Sebastian Raschka et al.

Computational Drug Discovery and Design

Get our desktop app

Company

Features

Documentation

Resources