out cross-validation (LOOCV) involves using a single instance
(or sample) from the sample set as the validation data and the
remaining instances as the training data. This is repeated such that
each instance in the sample set is used once as the validation data.
This is the same as ak-fold cross-validation withkbeing equal to the
number of instances in the original sample set. Leave-one-out
cross-validation is computationally expensive when the number of
samples in the training set is too large. In the task of prediction of
DNA-binding residues, the instance or sample can be a residue or
protein chain in the leave-one-out cross-validation. In order to test
the model in an unseen sample set, an independent test is usually
adopted and conducted on a separate set which is independent of
the training set. This type of test resembles a true prediction and
reflects the generalization ability of a prediction model.
In order to assess the classification performance, various
threshold-dependent metrics are utilized and defined as follows.
They are accuracy (ACC), sensitivity (SN, also called recall), speci-
ficity (SP), precision (PR), Matthew’s correlation coefficient
(MCC), and F-measure (F 1 ). These metrics are calculated using
the numbers of true positives (TP), false positives (FP), true nega-
tives (TN) , and false negatives (FN) for each classifier. Their
equations are defined in Table1.
TP is the number of correctly predicted DNA-binding residues,
TN is the number of correctly predicted nonbinding residues, FP is
the number of nonbinding residues predicted as binding residues,
and FN is the number of binding residues wrongly predicted as
nonbinding.
The receiver operating characteristic (ROC) curve is a plot of
the sensitivity versus (1-specificity) for a binary classifier at varying
thresholds. The area under the curve (AUC) can be used as a
threshold-independent measure of classification performance. It is
a nontrivial task to assess the quality of prediction for heavily
unbalanced data sets. On the unbalanced data sets, the accuracy
and AUC of ROC curve can present overly optimistic assessments
Table 1
A list of common metrics and their equations
Metric Equation
ACC (TP + TN)/(TP + TN + FP + FN)
SN TP/(TP + FN)
SP TN/(TN + FP)
PR TP/(TP + FP)
MCC ðÞTPTNFPFN=
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðÞTPþFN ðÞTPþFP ðÞTNþFP ðÞTNþFN
p
F 1 2 SNPR/(SN + PR)
230 Yi Xiong et al.