Computational Systems Biology Methods and Protocols.7z

out cross-validation (LOOCV) involves using a single instance (or sample) from the sample set as the validation data and the remaining instances as the training data. This is repeated such that each instance in the sample set is used once as the validation data. This is the same as ak-fold cross-validation withkbeing equal to the number of instances in the original sample set. Leave-one-out cross-validation is computationally expensive when the number of samples in the training set is too large. In the task of prediction of DNA-binding residues, the instance or sample can be a residue or protein chain in the leave-one-out cross-validation. In order to test the model in an unseen sample set, an independent test is usually adopted and conducted on a separate set which is independent of the training set. This type of test resembles a true prediction and reflects the generalization ability of a prediction model. In order to assess the classification performance, various threshold-dependent metrics are utilized and defined as follows. They are accuracy (ACC), sensitivity (SN, also called recall), specificity (SP), precision (PR), Matthew’s correlation coefficient (MCC), and F-measure (F 1 ). These metrics are calculated using the numbers of true positives (TP), false positives (FP), true negatives (TN) , and false negatives (FN) for each classifier. Their equations are defined in Table1. TP is the number of correctly predicted DNA-binding residues, TN is the number of correctly predicted nonbinding residues, FP is the number of nonbinding residues predicted as binding residues, and FN is the number of binding residues wrongly predicted as nonbinding. The receiver operating characteristic (ROC) curve is a plot of the sensitivity versus (1-specificity) for a binary classifier at varying thresholds. The area under the curve (AUC) can be used as a threshold-independent measure of classification performance. It is a nontrivial task to assess the quality of prediction for heavily unbalanced data sets. On the unbalanced data sets, the accuracy and AUC of ROC curve can present overly optimistic assessments

Table 1
A list of common metrics and their equations

Metric Equation ACC (TP + TN)/(TP + TN + FP + FN) SN TP/(TP + FN) SP TN/(TN + FP) PR TP/(TP + FP) MCC ðÞTPTNFPFN= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðÞTPþFN ðÞTPþFP ðÞTNþFP ðÞTNþFN

p

F 1 2 SNPR/(SN + PR)

230 Yi Xiong et al.

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources