Computational Systems Biology Methods and Protocols.7z

MCC¼

TPTNFPFN ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðÞTPþFPðÞTPþFNðÞTNþFPðÞTNþFN

p

where TP, TN, FP, and FN are the number of true positives (correctly predicted O-GlcNAcylated sites), true negatives (correctly predicted non-O-GlcNAcylated sites), false positives (falsely predicted O-GlcNAcylated sites), and false negatives (falsely predicted non-O-GlcNAcylated sites), respectively.

2.4 Classify
Algorithms

All the prediction algorithms shown in Table1 for identifying protein O-GlcNAcylated sites use machine learning methods, including support vector machines (SVM), artificial neural net- works, and hidden Markov model (HMM). SVM is a set of related supervised learning methods used for classification and regression based on statistical learning theory, which has been shown to be a powerful tool in many fields of bioinformatics [8, 49–52]. Five of the six predictors (Table1) were designed using a SVM classifier. The predictor designed by Kao et al. [10] is a two-layered machine learning method, incorporating a HMM profiles and SVM.

3 Performance of Online Tools for O-GlcNAcylation Sites Prediction

Each of the computational predictors for O-GlcNAcylation sites identification shown in Table1 has self-reported their sensitivity and specificity values. However, these values can often deviate significantly from the actual sensitivity and specificity of the predictor for two reasons. First, developers can mistakenly include testing data when refining the predictor, leading to an “over-fit.” Second, when new O-GlcNAcylation modification data are published, the overall characteristics of the available data may change [53]. Three of the six predictors do not have an available web server. Therefore, to assess the performances of the six predictors, we selected the independent dataset used in O-GlcNAcPRED and PGlcS [8, 10]. The dataset is composed of 67 O-GlcNAcylation sites and 7244 non-O-GlcNAcylation sites extracted from 38 experi- mentally identified O-GlcNAcylated proteins that were not included in the original dbOGAP. The sequences of each of the 38 proteins were uploaded into each predictor, and the prediction results are shown in Table2. Among the six predictors, PGlcS achieved the best sensitivity of 64.62%, which obviously does not meet the requirements of exper- imental biologists. The predictor O-GlcNAcscan gave the best specificity of 92.45%, but also gave the worst sensitivity of 31.34%. The differences between sensitivity and specificity in O-GlcNAcscan were mainly due to the predictor being trained on

Computational Prediction of Protein O-GlcNAc Modification 243

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources