Computational Drug Discovery and Design

algorithms have difficulty in learning the concepts of the minority class (drug targets) as compared to the majority class (nondrug targets) (seeNote 1). The generated classifier behaves like a majority class classifier. For handling the imbalance data problem two types of approaches can be applied (1) intrinsic approaches (2) extrinsic approach. The intrinsic approaches involves the adap- tion in machine learning algorithms for handling the class imbalance, while extrinsic approaches are algorithm independent and involves changes in the dataset to handle class imbalance, a popular example is SMOTE [37] and its variants. Known drug targets are considered as the positive class while proteins which are not confirmed as targets are considered as non drug targets (negative class). Likewise confirmed/ known drug–target pairs are considered positive examples and unknown drug–target pairs are considered as negative examples.

3.2 Feature
Engineering

The machine learning algorithm requires a fixed length representa- tion of the protein sequences. Many packages and web servers are now available which can be used to calculate the various measurable properties from the protein sequences. In [9] authors have calculated amino acid composition, dipeptide composition, and amino acid property group to discriminate between human drug targets and non targets. Apart from simple sequence features, sophisticated features like pseudo amino acid composition can also be calculated with the help of various programs as mentioned in Subheading2.

3.3 Feature Selection Certain learning algorithms may perform worse, when learning con-
ceptstodiscriminate betweenclassesform highdimensionaldata,this
situationisknownasthe“curseofdimensionality.”Theaimof feature
selection algorithms is to select a minimum set of features while
achieving maximum classification accuracy (seeNote 2). Using a
reduced feature set provides many advantages as reduced training
time, model simplification, and reduction in overfitting. Feature
selection can be stated as:Given a set of predicted/calculated features
(attributes) F and a target variable (class) “C.” Find minimum of set f
achieving maximum classification of C. In majority of cases feature
selection improves the performance of classification algorithm.
There are basically two main approaches for feature selection
(1) Wrapper approach (2) Filter approach. In wrapper approach,
one keeps adding features using a certain classifier until no further
improvement can be achieved (forward search) or one can start
with the full feature set and keeps on removing one feature at a time
until no further improvement is recorded. Alternatively one can use
both adding and removing features in two phases. Filter methods
are independent of classifier but uses association of features with the
target class for feature selection [38].

26 Abhigyan Nath et al.

Computational Drug Discovery and Design

Get our desktop app

Company

Features

Documentation

Resources