134 A.E. Kister et al.
Table 7.1.Identifying four families of sandwich-like proteins within 11 distinct
genomes four protein families are classified as in SCOP database: (1) (PL) – protein
family of lipoxygenase N-terminal domain; (2) (AT) – protein family: Alpha-toxin,
C-terminal domain; (3) (AD) corresponds to 30-kd adipocyte complement-related
protein; (4) (TR) corresponds to TRANCE/RANKL cytokine protein domain. The
first column lists the names of organisms from which the genomes are derived. The
second column contains numbers of proteins sequenced from respective genomes.
The number of sequences belonging to each group of proteins (PL, AT, AD, or
TR) found in the genome using our method of sequence determinants (MSD) is
given in the “MSD” columns. “HMM” columns show the number of sequences of
the respective groups of proteins found using the hidden Markov models
Genomes Proteins HMM MSD HMM MSD HMM MSD HMM MSDArabidopsis thaliana 25617 8 11 4 5 0000
Clostridium acetobutylicum 3672 01120003
Clostridium perfringens 2660 02110000
Mesorhizobium loti 6752 11020000
Pseudomonas aeruginosa 5567 00100000
Caenorhabditis blegans 20448 59000000
Drosophila melanogaster 14335 25000011
Escherichia coli K12 4289 00010000
Escherichia coli 0157H7 5361 01010000
Bacillus halodurans 4066 00000000
Lactococcus lactis 2266 01001100
The results of applying the search algorithm that uses sequence determi-
nants of 4 protein families in 11 different genomes are presented in Table 7.1.
MSD’ column of the table contains data on how many proteins of the given
family were found in the respective genome through application of our algo-
rithm. For comparison purposes, “HMM” column gives the number of proteins
of the family found using HMM search procedure, considered to be the most
powerful of currently used method [34].
Overall, both methods found approximately the same number of SPs in
the 11 genomes. All sequences founded by HMM were detected by our ap-
proach (except one). However, our method revealed a number of additional
sequences that can be putatively assigned to the four families. For the most
part, these “additional” proteins are labeled as “unrecognized proteins” in
the genome. It is suggested that our approach can identify even those SPs
that are “hidden” from HMM search procedure. Further investigations are
necessary to tell whether these “candidate” sequences indeed qualify to join
the respective SP families. Our approach also provides an independent check
on the accuracy of HMM-based algorithm.
