Computational Drug Discovery and Design

2.2 Databases
for Creating
the Dataset

A number of databases are now available consisting of specialized
information about drug and their targets. For example DrugBank
[19] consists of information about drug–target relationships; like-
wise Matador [20] also consists of direct and indirect drug–target
relationships. Similarly SuperTarget [20] and Therapeutic Target
Database (TTD) [21] also have high quality information about
drug–target relationships. Along with the drug–target relationship
data, Integrity [22] provides associated disease information also.
Potential Drug Target Database (PDTD) [23] augments the infor-
mation of drug–target relationship with structural data of the target
while BindingDB [24] is one of the major databases consisting of
experimentally derived protein–ligand binding affinities.

2.3 Tools
and Servers
for the Calculation
of Features

A number of stand-alone programs and web servers are available
which can be used to generate a variety of features of protein targets
and non targets. PROFEAT [25, 26] is one of the oldest web
servers capable of calculating structural and physicochemical prop-
erties from proteins sequences. PseAAC-builder [27] which is a
stand-alone program and PseAAC [28] a web server are dedicated
to the generation of various modes of pseudo amino acid composi-
tion [29]. Pse-in-One [30] which is a web server providing services
for the calculation of pseudo components for proteins as well as
nucleic acids. Complimentary to other programs ProtDcal [31] can
also be used for generating a number of numerical descriptors from
protein sequences as well as from 3D structures. Apart from the
above mentioned stand-alone programs and web servers, propy
[32] and protr [33], a python and an R package respectively may
be implemented for the calculation of a large number of attributes
as per the need of the problem. Various types of molecular descrip-
tors of molecular compounds can also be easily computed using
web servers like PaDEL [34], MODEL [35], and Mold2
[36]. These calculated features can be used as features in developing
prediction models for drug–target interactions.

3 Methods

3.1 Dataset Creation For any supervised learning algorithm there should be a labeled
dataset, i.e., data instances along with their classes. The foremost
requirement for training a supervised learning algorithm is the
availability of a benchmark dataset having proper representation
of the various classes (in case of binary classification—positive and
negative classes), but seldom it is so. A dataset is said to be imbal-
anced when the number of data points (instances/examples)
belonging to a particular class overwhelms the number of data
points of the other class. In the case of human drug target predic-
tion the number of instances belonging to the drug target is less as
compared to the non targets. In such cases the machine learning

