Computational Drug Discovery and Design

(backadmin) #1
algorithms have difficulty in learning the concepts of the minority
class (drug targets) as compared to the majority class (nondrug
targets) (seeNote 1). The generated classifier behaves like a major-
ity class classifier. For handling the imbalance data problem two
types of approaches can be applied (1) intrinsic approaches
(2) extrinsic approach. The intrinsic approaches involves the adap-
tion in machine learning algorithms for handling the class imbal-
ance, while extrinsic approaches are algorithm independent and
involves changes in the dataset to handle class imbalance, a popular
example is SMOTE [37] and its variants. Known drug targets are
considered as the positive class while proteins which are not con-
firmed as targets are considered as non drug targets (negative class).
Likewise confirmed/ known drug–target pairs are considered posi-
tive examples and unknown drug–target pairs are considered as
negative examples.

3.2 Feature
Engineering


The machine learning algorithm requires a fixed length representa-
tion of the protein sequences. Many packages and web servers are
now available which can be used to calculate the various measurable
properties from the protein sequences. In [9] authors have calcu-
lated amino acid composition, dipeptide composition, and amino
acid property group to discriminate between human drug targets
and non targets. Apart from simple sequence features, sophisticated
features like pseudo amino acid composition can also be calculated
with the help of various programs as mentioned in Subheading2.

3.3 Feature Selection Certain learning algorithms may perform worse, when learning con-
ceptstodiscriminate betweenclassesform highdimensionaldata,this
situationisknownasthe“curseofdimensionality.”Theaimof feature
selection algorithms is to select a minimum set of features while
achieving maximum classification accuracy (seeNote 2). Using a
reduced feature set provides many advantages as reduced training
time, model simplification, and reduction in overfitting. Feature
selection can be stated as:Given a set of predicted/calculated features
(attributes) F and a target variable (class) “C.” Find minimum of set f
achieving maximum classification of C. In majority of cases feature
selection improves the performance of classification algorithm.
There are basically two main approaches for feature selection
(1) Wrapper approach (2) Filter approach. In wrapper approach,
one keeps adding features using a certain classifier until no further
improvement can be achieved (forward search) or one can start
with the full feature set and keeps on removing one feature at a time
until no further improvement is recorded. Alternatively one can use
both adding and removing features in two phases. Filter methods
are independent of classifier but uses association of features with the
target class for feature selection [38].


26 Abhigyan Nath et al.

Free download pdf