Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
predictors contains 410 O-GlcNAcylation sites and 410 non-O-
GlcNAcylation sites from dbOGAP, OGlycBase, and UniProtKB,
and the most current test dataset contains 956 O-GlcNAcylation
sites and 60,976 non-O-GlcNAcylation sites from PhosphoSite-
Plus. Detailed information about the number of samples in the
training datasets and the length of each sample for the six predictors
is listed in Table1.
It should be indicated that in the paper [19–30], a total of 1181
O-GlcNAc sites on 520 proteins were collected from publications
since 2012, but these data have not been used in the recent pre-
dictors. Therefore, the construction of an up-to-date, comprehen-
sive, reliable dataset is expected in the near future.

2.2 Feature
Representation and
Selection


To establish a really useful statistical predictor, the features of the
protein or peptide samples need to be represented by an effective
mathematical expression that can truly reflect their intrinsic corre-
lation with the object to be predicted [9, 10, 31–41]. Here, ten
different features, such as orthogonal binary coding and amino acid
composition and so on, were used to represent peptides as
described below.

2.2.1 Orthogonal Binary
Coding


For this feature, each amino acid type is coded with 21 binary
values; e.g., “100...0” (one followed by 20 zeros) for A (alanine),
“010...0” for C (cysteine),..., “000...010” for Y (tyrosine), and
“000...001” for X (pseudo amino acid) [6].

2.2.2 Amino Acid
Composition and Amino
Acid Pair Composition


Twenty elements (f 20 ) are used to specify the numbers of occur-
rences of 20 amino acids normalized with the total number of
residues in a sequence fragment, and the 400 elements (f 400 )
specify the numbers of occurrences of 400 amino acid pairs normal-
ized with the total number of amino acid pairs in a sequence
fragment [9]. These two types of characteristics are calculated as
follows:

f 20 ðÞ¼i

x 20 ðÞi
L
f 400 ðÞ¼j

y 400 ðÞj
L 1

wherex 20 (i) andy 400 (j) represent the number of occurrences
of residueiand the number of occurrences of amino acidjin a
peptide sequence, respectively, andLis the length of each peptide
sequence.

2.2.3 Positional
Weighted Matrix (PWM)


The relative frequency of amino acids that surround
O-GlcNAcylation sites and the fragment sequence are denoted
and encoded by a PWM. A PWM containing (2n+1)ωelements
can profile the distribution of amino acids in the training dataset.
Here, 2n+ 1 denotes the window size andωrepresents 20 amino
acids and one terminal signal [9].

Computational Prediction of Protein O-GlcNAc Modification 239
Free download pdf