Computational Systems Biology Methods and Protocols.7z

2.2.4 Adapted Normal
Distribution Bi-profile
Bayes

The bi-profile Bayes feature extraction approach was first proposed by Xu et al. [31], in which each peptide in a training dataset can be encoded as (p 1 ,p 2 ,...,pn,pn+1,...,p 2 n), where (p 1 ,p 2 ,...,pn) represents the posterior probability of each amino acid at each position in the positive dataset and (pn+1,...,p 2 n) represents the posterior probability of each amino acid at each position in the negative dataset. The posterior probability is calculated as the occurrence of each amino acid at each position in the training dataset. This feature extraction approach was improved by encod- ing the frequency of each amino acid at each position as random variablesXij, wherei(i¼1,2,...,20) represents theith amino acid. {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} and j(j¼1,2,...,L) represents thejth position. The random variables Xij(i¼1,2,...,20;j¼1,2,...,L) are independent and obey the same binomial distributionb(n,p), wheren¼339 is the number of peptide sequences in the positive/negative dataset andp¼1/20 is the probability of each amino acid occurring in each position. According to the De Moivre-Laplace theorem [42], the normal form variable Xffiffiffiffiffiffiffiffiffiffiffiffiffiijnp npðÞ 1 p

p has a limiting cumulative distribution func-

tion which approximates a normal distribution N(0,1). In O-GlcNAcPRED [8], the standard variable normalization method was modified to highlight and emphasize the distinct distribution of each amino acid at the same position. IfVjdenotes the standard variance ofXij(i¼1,2,...,20), i.e., the deviation of frequencies of each amino acid at the samejth position, thenX^0 ij¼ Xijffiffiffiffiffinp Vj p denotes the new normalization ofXijand ensures it obeys the standard normal distribution. Thus, the posterior probability pj (j¼1,2,...,2n) is coded by the adapted normal distribution as follows: pj¼P(XXij)¼φ(X^0 ij), whereφ(x) is the standard normal distribution function given byφðÞ¼x p^1 ffiffiffiffi 2 π

Rx 1

e

t 22 dt.

2.2.5 Position-Specific
Scoring Matrix

Evolutionary information is an important characteristic of proteins, because the conserved residues at specific sequence sites are under strong selective pressure and therefore are always functionally rele- vant [32–35]. In [9, 10], a position-specific scoring matrix (PSSM) was used to encode the evolutionary information of a protein sequence. The PSSM profiles were obtained by PSIBLAST searches [36] against nonredundant sequences at O-GlcNAcylated sites. The matrix of (2n+1)20 elements had rows centered on the substrate sites (serine/threonine) extracted from the PSSM profile, where 2n+ 1 represents the window size and 20 represents the position-specific scores for each type of amino acid. Then, the (2n+1)20 matrix was transformed into a 2020 matrix by summing up the rows that were associated with the same type of amino acid. Finally, every element in the 2020 matrix was divided by the window length 2n+ 1 and then normalized using

240 Cangzhi Jia and Yun Zuo

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources