2.2.4 Adapted Normal
Distribution Bi-profile
Bayes
The bi-profile Bayes feature extraction approach was first proposed
by Xu et al. [31], in which each peptide in a training dataset can be
encoded as (p 1 ,p 2 ,...,pn,pn+1,...,p 2 n), where (p 1 ,p 2 ,...,pn)
represents the posterior probability of each amino acid at each
position in the positive dataset and (pn+1,...,p 2 n) represents the
posterior probability of each amino acid at each position in the
negative dataset. The posterior probability is calculated as the
occurrence of each amino acid at each position in the training
dataset. This feature extraction approach was improved by encod-
ing the frequency of each amino acid at each position as random
variablesXij, wherei(i¼1,2,...,20) represents theith amino acid.
{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} and
j(j¼1,2,...,L) represents thejth position. The random variables
Xij(i¼1,2,...,20;j¼1,2,...,L) are independent and obey the
same binomial distributionb(n,p), wheren¼339 is the number of
peptide sequences in the positive/negative dataset andp¼1/20 is
the probability of each amino acid occurring in each position.
According to the De Moivre-Laplace theorem [42], the normal
form variable
Xffiffiffiffiffiffiffiffiffiffiffiffiffiijnp
npðÞ 1 p
p has a limiting cumulative distribution func-
tion which approximates a normal distribution N(0,1). In
O-GlcNAcPRED [8], the standard variable normalization method
was modified to highlight and emphasize the distinct distribution
of each amino acid at the same position. IfVjdenotes the standard
variance ofXij(i¼1,2,...,20), i.e., the deviation of frequencies of
each amino acid at the samejth position, thenX^0 ij¼
Xijffiffiffiffiffinp
Vj
p denotes
the new normalization ofXijand ensures it obeys the standard
normal distribution. Thus, the posterior probability pj
(j¼1,2,...,2n) is coded by the adapted normal distribution as
follows:
pj¼P(XXij)¼φ(X^0 ij), whereφ(x) is the standard normal
distribution function given byφðÞ¼x p^1 ffiffiffiffi 2 π
Rx
1
e
t 22
dt.
2.2.5 Position-Specific
Scoring Matrix
Evolutionary information is an important characteristic of proteins,
because the conserved residues at specific sequence sites are under
strong selective pressure and therefore are always functionally rele-
vant [32–35]. In [9, 10], a position-specific scoring matrix (PSSM)
was used to encode the evolutionary information of a protein
sequence. The PSSM profiles were obtained by PSIBLAST searches
[36] against nonredundant sequences at O-GlcNAcylated sites.
The matrix of (2n+1)20 elements had rows centered on the
substrate sites (serine/threonine) extracted from the PSSM profile,
where 2n+ 1 represents the window size and 20 represents the
position-specific scores for each type of amino acid. Then, the
(2n+1)20 matrix was transformed into a 2020 matrix by
summing up the rows that were associated with the same type of
amino acid. Finally, every element in the 2020 matrix was
divided by the window length 2n+ 1 and then normalized using
240 Cangzhi Jia and Yun Zuo