the formula 1 þ^1 exin Wu et al. [9]. Zhao et al. [10] incorporated the
output of the PSSM with amino acid physicochemical properties, as
discussed in the following section.
2.2.6 Physicochemical
Properties
The specificity and diversity of protein structure and function are
largely attributed to the various properties of each of the 20 amino
acids in the sequence [37]. The physicochemical properties of amino
acids surrounding serine/threonine residues have been reported to
influence their O-GlcNAcylation. Therefore, some predictors have
applied the physicochemical and biochemical properties of amino
acids to predict protein O-GlcNAcylation sites [8, 10]. The AAindex
database (http://www.genome.jp/aaindex/)[43] includes amino
acid mutation matrices and amino acid indexes. Version 9.0 contains
544 physicochemical properties. An amino acid index is a set of
20 numerical values that denote various physicochemical properties
of amino acids. Physicochemical properties have been used success-
fully to predict several protein modifications [44–48]. Zhao et al.
[10] selected 17 informative physicochemical properties to encode
protein peptides in PGlcS method. They multiplied 20 PSSM values
of each residue by the corresponding physicochemical property of
the 20 amino acids and summed the properties of each residue
according to the following equation:
Fki¼
X^20
j¼ 1
PkjMij
wherePkjrepresents the value of thekth physicochemical property
for thejth amino acid type (k¼1, 2,...,L) andMijrepresents the
raw value of theith position for thejth amino acid type in the PSSM
(i¼1, 2,..., 15).
2.2.7 Secondary
Structure
To predict the secondary structure of all the amino acid residues,
the full-length protein sequences with O-GlcNAcylated sites were
submitted to PSIPRED [38]. PSIPRED outputs sequences com-
posed of the characters “C,” “H,” and “E,” where “C” represents
coil, “H” represents helix, and “E” represents strand. The orthog-
onal binary coding scheme was used to transform these secondary
structure terms into numeric vectors; for instance, a helix (H) is
encoded as “001,” a strand (E) is encoded as “010,” and a coil
(C) is encoded as “100” in PGlcS [10].
2.2.8 Disorder DISOPRED2, which is one of the leading servers for predicting
natively disordered regions in proteins [39], was applied to obtain
disorder information for the amino acid sequences. A disordered
structure is encoded as “0 1,” while an ordered structure is encoded
as “1 0” in PGlcS [10].
Computational Prediction of Protein O-GlcNAc Modification 241