COMPUTATIONAL TOOLS 101
4.4.8.2 A Contemporary Example: Protein Family Classification and
Data Integration for Functional Analysis of Proteins
New bioinformatics methods allow inference of protein function using associative analysis (“guilt
by association”) of functional properties to complement the traditional sequence homology-based meth-
ods.^131 Associative properties that have been used to infer function not evident from sequence homol-
ogy include co-occurrence of proteins in operons or genome context; proteins sharing common domains
in fusion proteins; proteins in the same pathway, subcellular network, or complex; proteins with corre-
lated gene or protein expression patterns; and protein families with correlated taxonomic distribution
(common phylogenetic or phyletic patterns).
Coupling protein classification and data integration allows associative studies of protein family,
function, and structure.^132 An example is provided in Figure 4.4, which illustrates how the collective
use of protein family, pathway, and genome context in bacteria helped researchers to identify a long-
sought human gene associated with the methylmalonic aciduria disorder.
Domain-based or structural classification-based searches allow identification of protein families
sharing domains or structural fold classes. Functional convergence (unrelated proteins with the same
activity) and functional divergence are revealed by the relationships between the enzyme classification
and protein family classification. With the underlying taxonomic information, protein families that
occur in given lineages can be identified. Combining phylogenetic pattern and biochemical pathway
information for protein families allows identification of alternative pathways to the same end product
in different taxonomic groups, which may present attractive potential drug targets. The systematic
approach for protein family curation using integrative data leads to novel prediction and functional
inference for uncharacterized “hypothetical” proteins, and to detection and correction of genome anno-
tation errors (a few examples are listed in Table 4.2). Such studies may serve as a basis for further
analysis of protein functional evolution, and its relationship to the coevolution of metabolic pathways,
cellular networks, and organisms.
Underlying this approach is the availability of resources that provide analytical tools and data. For
example, the Protein Information Resource (PIR) is a public bioinformatics resource that provides an
advanced framework for comparative analysis and functional annotation of proteins. PIR recently
joined the European Bioinformatics Institute and Swiss Institute of Bioinformatics to establish
UniProt,^133 an international resource of protein knowledge that unifies the PIR, Swiss-Prot, and TrEMBL
databases. Central to the PIR-UniProt functional annotation of proteins is the PIRSF (SuperFamily)
classification system^134 that provides classification of whole proteins into a network structure to reflect
their evolutionary relationships. This framework is supported by the iProClass integrated database of
protein family, function, and structure,^135 which provides value-added descriptions of all UniProt pro-
teins with rich links to more than 50 other databases of protein family, function, pathway, interaction,
modification, structure, genome, ontology, literature, and taxonomy. As a core resource, the PIR envi-
ronment is widely used by researchers to develop other bioinformatics infrastructures and algorithms
and to enable basic and applied scientific research, as shown by examples in Table 4.3.
(^131) E.M. Marcotte, M. Pellegrini, M.J. Thompson, T.O. Yeates, and D. Eisenberg, “Combined Algorithm for Genome-wide
Prediction of Protein Function,” Nature 402(6757):83-86, 1999.
(^132) C.H. Wu, H. Huang, A. Nikolskaya, Z. Hu, and W.C. Barker, “The iProClass Integrated Database for Protein Functional
Analysis,” Computational Biology and Chemistry 28(1):87-96, 2004.
(^133) R. Apweiler, A. Bairoch, C.H. Wu, W.C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, et al., “UniProt: Universal Protein
Knowledgebase,” Nucleic Acids Research 32(Database issue):D115-D119, 2004.
(^134) C.H. Wu, A. Nikolskaya A, H. Huang, L.S. Yeh, D.A. Natale, C.R. Vinayaka, Z.Z. Hu, et al., “PIRSF Family Classification
System at the Protein Information Resource,” Nucleic Acids Research 32(Database issue):D112-D114, 2004.
(^135) C.H. Wu, H. Huang, A. Nikolskaya, Z. Hu, and W.C. Barker, “The iProClass Integrated Database for Protein Functional
Analysis,” Computational Biology and Chemistry 28(1):87-96, 2004.