Because only neighboring residues can influence the status of cen-
tered serine or threonine residues, a sliding window strategy has
been utilized to extract positive data from protein sequences as a
training dataset, which will contain peptide sequences with
O-GlcNAcylation sites symmetrically surrounded by flanking resi-
dues [5, 6, 16]. The number of residues considered in a window is
important because too few may omit information useful for making
predictions, while too many may introduce ineluctable redundancy
and decrease the signal-to-noise ratio. The most appropriate win-
dow size is still unclear, and most researchers have preliminarily
tested different fragment sizes and then have chosen the window
size that gives the best predictive performance [5, 6, 8–11, 16,
18 ]. dbOGAP, which was created by Wang et al. [4], is a database
of O-GlcNAcylated proteins and sites, primarily based on literature
published before April 2010. The database currently contains
approximately 800 proteins with experimental O-GlcNAcylation
information; about 61% are proteins from human, and 172 of the
proteins have a total of approximately 400 identified
O-GlcNAcylation sites [6]. Most of the existing predictors have
been trained on datasets constructed from dbOGAP.
The first predictor was constructed by Gupta and Brunak
[7]. It is an artificial neural network system trained on sequence
fragments of about 40 GlcNAcylation sites that were available at the
time. However, as far as we know, the original proteins that were
used to construct the dataset can no longer be found. The second
prediction system, OGlcNAcScan, was developed based on anno-
tated O-GlcNAcylation proteins collected in dbOGAP [6]. The
training dataset consisted of 373 positive instances that were exper-
imentally verified O-GlcNAcylation sites in 167 protein sequences
from dbOGAP and 29,897 negative instances that were the rest of
the unannotated serine/threonine sites in the same protein
sequences. According to the prediction performance, the length
of 11 of each protein peptide sequence is selected in Wang et al.
training dataset [6].
The predictor O-GlcNAcPRED was developed by Jia et al.
[8]. It was trained on a balanced dataset containing 339 positive
O-GlcNAcylation sites and 339 non-O-GlcNAcylation sites
derived from dbOGAP. The authors also provided an independent
test dataset containing 67 O-GlcNAcylation sites in 38 experimen-
tally identified proteins that were found by searching published
literature and which were not included in dbOGAP. After cross-
validation, the length of each protein peptide sequence was selected
as 23 in the training and test datasets of Jia et al. [8]. The predictor
PGlcS was developed by Zhao et al. [10] in 2015. It was based on
the same positive training dataset and test dataset as
O-GlcNAcPRED [8].
Lee’s research group successively constructed two predictors in
2014 and 2015 [9, 11]. The most current training dataset for these
238 Cangzhi Jia and Yun Zuo