Computational Drug Discovery and Design

3.5 Sequential
Feature Selection with
Logistic Regression

As an alternative approach and to probe the robustness of our conclusions, we will apply a Sequential Backward Selection (SBS) algorithm combined with logistic regression [32] for the classification of active versus non-active compounds. SBS is a model-agnostic feature selection algorithm that evaluates different combinations of features, shrinking the subset of features to be considered one by one. Here, model-agnostic refers to the fact that SBS can be combined with any machine learning algorithm for classification or regression. In general, sequential feature selection algorithms are greedy search algorithms that reduce thed-dimensional feature space to a smallerk-dimensional subspace, wherek<d. The sequential feature selection approach selects the best-performing feature subsets automatically and can help optimizing two objectives: improving the computational efficiency and reducing the generalization error of a model by getting rid of features that are irrelevant. The SBS algorithm removes features from the initial feature subset sequentially until a new, reduced feature subspace contains a specified number of features. To determine a feature that is to be removed at each iteration of the SBS algorithm, we need to define a criterion functionJ, which is to be minimized. For instance, this criterion function is defined as the difference between the performance of the model before and after the feature removal. In other words, at each iteration of the algorithm, the feature that results in the least performance loss (or highest performance gain) is elimi- nated. This removal of features is repeated in each iteration of the algorithm until the desired, pre-specified size of the feature subset is reached. More formally, we can express the SBS algorithm in the following pseudo-code notation adapted from [30]:

Initialize the algorithm withk=d, wheredis the dimensionality
of the full feature spaceXd.

Determine the feature x that maximizes the criterion:
x=argmaxJ(Xdx),wherex∈Xk.

Remove the featurexfrom the feature set:Xk 1 =Xkx;
k=k–1.

Terminate ifkequals the number of desired features; otherwise,
go tostep 2.
The reason why we chose sequential feature selection to deduce
functional group matching patterns that are predictive of active and
non-active molecules is that it presents an intuitive method that has
been shown to produce accurate and robust results (seeNote 12).
For more information on sequential feature selection, please
read [17].
Logistic regression is one of the most widely used classification
algorithms in academia and industry. One of the reasons why
logistic regression is a popular choice for predictive modeling is

Inferring Activity Discriminants 325

Computational Drug Discovery and Design

Get our desktop app

Company

Features

Documentation

Resources