Computational Systems Biology Methods and Protocols.7z

Because only neighboring residues can influence the status of cen- tered serine or threonine residues, a sliding window strategy has been utilized to extract positive data from protein sequences as a training dataset, which will contain peptide sequences with O-GlcNAcylation sites symmetrically surrounded by flanking residues [5, 6, 16]. The number of residues considered in a window is important because too few may omit information useful for making predictions, while too many may introduce ineluctable redundancy and decrease the signal-to-noise ratio. The most appropriate window size is still unclear, and most researchers have preliminarily tested different fragment sizes and then have chosen the window size that gives the best predictive performance [5, 6, 8–11, 16, 18 ]. dbOGAP, which was created by Wang et al. [4], is a database of O-GlcNAcylated proteins and sites, primarily based on literature published before April 2010. The database currently contains approximately 800 proteins with experimental O-GlcNAcylation information; about 61% are proteins from human, and 172 of the proteins have a total of approximately 400 identified O-GlcNAcylation sites [6]. Most of the existing predictors have been trained on datasets constructed from dbOGAP. The first predictor was constructed by Gupta and Brunak [7]. It is an artificial neural network system trained on sequence fragments of about 40 GlcNAcylation sites that were available at the time. However, as far as we know, the original proteins that were used to construct the dataset can no longer be found. The second prediction system, OGlcNAcScan, was developed based on anno- tated O-GlcNAcylation proteins collected in dbOGAP [6]. The training dataset consisted of 373 positive instances that were exper- imentally verified O-GlcNAcylation sites in 167 protein sequences from dbOGAP and 29,897 negative instances that were the rest of the unannotated serine/threonine sites in the same protein sequences. According to the prediction performance, the length of 11 of each protein peptide sequence is selected in Wang et al. training dataset [6]. The predictor O-GlcNAcPRED was developed by Jia et al. [8]. It was trained on a balanced dataset containing 339 positive O-GlcNAcylation sites and 339 non-O-GlcNAcylation sites derived from dbOGAP. The authors also provided an independent test dataset containing 67 O-GlcNAcylation sites in 38 experimen- tally identified proteins that were found by searching published literature and which were not included in dbOGAP. After cross- validation, the length of each protein peptide sequence was selected as 23 in the training and test datasets of Jia et al. [8]. The predictor PGlcS was developed by Zhao et al. [10] in 2015. It was based on the same positive training dataset and test dataset as O-GlcNAcPRED [8]. Lee’s research group successively constructed two predictors in 2014 and 2015 [9, 11]. The most current training dataset for these

238 Cangzhi Jia and Yun Zuo

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources