protein O-GlcNAcylation sites with respect to their datasets, fea-
ture extraction methods, and classifier algorithms. We also discuss
the future challenges and outstanding questions.
2 Material and Methods
Computational prediction of protein O-GlcNAcylation sites can be
formulated as a two-class classification problem. The systematic
flowchart of the prediction method is summarized in Fig.2. The
method consists mainly of the following components: dataset con-
struction and preprocessing, sequence feature representation and
selection, and prediction algorithms. We discuss the six computa-
tional predictors for O-GlcNAcylation sites identification from the
above three aspects that were provided by these studies [6–11].
2.1 Datasets
Construction and
Preprocessing
It is crucial to construct a high-quality benchmark dataset for
unbiased performance evaluation. The datasets used to predict
protein O-GlcNAcylation sites are generally constructed from the
UniProtKB/Swiss-Prot Database [12], dbPTM [13], dbOGAP
[6], O-GlycBase [14], PhosphoSitePlus [13], and the PubMed
literature. For O-GlcNAcylation sites prediction, experimentally
verified O-GlcNAcylation sites are defined as the positive dataset.
Fig. 2The systematic flowchart of the O-GlcNAcylation prediction method
Computational Prediction of Protein O-GlcNAc Modification 237