Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
MCC¼

TPTNFPFN
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðÞTPþFPðÞTPþFNðÞTNþFPðÞTNþFN

p

where TP, TN, FP, and FN are the number of true positives (cor-
rectly predicted O-GlcNAcylated sites), true negatives (correctly
predicted non-O-GlcNAcylated sites), false positives (falsely pre-
dicted O-GlcNAcylated sites), and false negatives (falsely predicted
non-O-GlcNAcylated sites), respectively.

2.4 Classify
Algorithms


All the prediction algorithms shown in Table1 for identifying
protein O-GlcNAcylated sites use machine learning methods,
including support vector machines (SVM), artificial neural net-
works, and hidden Markov model (HMM). SVM is a set of related
supervised learning methods used for classification and regression
based on statistical learning theory, which has been shown to be a
powerful tool in many fields of bioinformatics [8, 49–52]. Five of
the six predictors (Table1) were designed using a SVM classifier.
The predictor designed by Kao et al. [10] is a two-layered machine
learning method, incorporating a HMM profiles and SVM.

3 Performance of Online Tools for O-GlcNAcylation Sites Prediction


Each of the computational predictors for O-GlcNAcylation sites
identification shown in Table1 has self-reported their sensitivity
and specificity values. However, these values can often deviate
significantly from the actual sensitivity and specificity of the predic-
tor for two reasons. First, developers can mistakenly include testing
data when refining the predictor, leading to an “over-fit.” Second,
when new O-GlcNAcylation modification data are published, the
overall characteristics of the available data may change [53]. Three
of the six predictors do not have an available web server. Therefore,
to assess the performances of the six predictors, we selected the
independent dataset used in O-GlcNAcPRED and PGlcS
[8, 10]. The dataset is composed of 67 O-GlcNAcylation sites
and 7244 non-O-GlcNAcylation sites extracted from 38 experi-
mentally identified O-GlcNAcylated proteins that were not
included in the original dbOGAP. The sequences of each of the
38 proteins were uploaded into each predictor, and the prediction
results are shown in Table2.
Among the six predictors, PGlcS achieved the best sensitivity of
64.62%, which obviously does not meet the requirements of exper-
imental biologists. The predictor O-GlcNAcscan gave the best
specificity of 92.45%, but also gave the worst sensitivity of
31.34%. The differences between sensitivity and specificity in
O-GlcNAcscan were mainly due to the predictor being trained on

Computational Prediction of Protein O-GlcNAc Modification 243
Free download pdf