Computational Systems Biology Methods and Protocols.7z

predictors contains 410 O-GlcNAcylation sites and 410 non-O- GlcNAcylation sites from dbOGAP, OGlycBase, and UniProtKB, and the most current test dataset contains 956 O-GlcNAcylation sites and 60,976 non-O-GlcNAcylation sites from PhosphoSite- Plus. Detailed information about the number of samples in the training datasets and the length of each sample for the six predictors is listed in Table1. It should be indicated that in the paper [19–30], a total of 1181 O-GlcNAc sites on 520 proteins were collected from publications since 2012, but these data have not been used in the recent predictors. Therefore, the construction of an up-to-date, comprehen- sive, reliable dataset is expected in the near future.

2.2 Feature
Representation and
Selection

To establish a really useful statistical predictor, the features of the protein or peptide samples need to be represented by an effective mathematical expression that can truly reflect their intrinsic corre- lation with the object to be predicted [9, 10, 31–41]. Here, ten different features, such as orthogonal binary coding and amino acid composition and so on, were used to represent peptides as described below.

2.2.1 Orthogonal Binary
Coding

For this feature, each amino acid type is coded with 21 binary values; e.g., “100...0” (one followed by 20 zeros) for A (alanine), “010...0” for C (cysteine),..., “000...010” for Y (tyrosine), and “000...001” for X (pseudo amino acid) [6].

2.2.2 Amino Acid
Composition and Amino
Acid Pair Composition

Twenty elements (f 20 ) are used to specify the numbers of occurrences of 20 amino acids normalized with the total number of residues in a sequence fragment, and the 400 elements (f 400 ) specify the numbers of occurrences of 400 amino acid pairs normalized with the total number of amino acid pairs in a sequence fragment [9]. These two types of characteristics are calculated as follows:

f 20 ðÞ¼i

x 20 ðÞi L f 400 ðÞ¼j

y 400 ðÞj L 1

wherex 20 (i) andy 400 (j) represent the number of occurrences of residueiand the number of occurrences of amino acidjin a peptide sequence, respectively, andLis the length of each peptide sequence.

2.2.3 Positional
Weighted Matrix (PWM)

The relative frequency of amino acids that surround O-GlcNAcylation sites and the fragment sequence are denoted and encoded by a PWM. A PWM containing (2n+1)ωelements can profile the distribution of amino acids in the training dataset. Here, 2n+ 1 denotes the window size andωrepresents 20 amino acids and one terminal signal [9].

Computational Prediction of Protein O-GlcNAc Modification 239

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources