112 5 Survey of Ontologies in Bioinformatics
more flexible and less stable than those in a crystal. Indeed, solution struc-
tures determined by the NMR data are slightly different from crystal struc-
tures. Therefore, NMR is often used to study small and peculiar proteins.
Protein glycosylation is probably the most common and complex type
of co- and post-translational modification encountered in proteins (Lutteke
et al. 2004). Inspection of the protein databases reveals that 70% of all pro-
teins have potential N-glycosylation sites - Asn-X-Ser/Thr, where X is not
Pro (Mellquist et al. 1998). O-glycosylation is even more ubiquitous (Berman
et al. 2000). Consequently, PDB entries contain not only protein structures
but also pure carbohydrate structures. However, to date, there is no standard
nomenclature for carbohydrate residues within the PDB files (Westbrook and
Bourne 2000). For example, although many monosaccharide residues are de-
fined in the PDB Het Group Dictionarypdb.rutgers.edu/het_dictio
nary.txt, there is no distinction between theα-andtheβ-forms. Thus, it
is difficult for glycobiologists to find relevant carbohydrate structures from
PDB.
The PDB database has two non-XML formats, PDB and mmCIF, that are in
use by many other molecular structure databases. Recently an XSD format,
PDBML, has been introduced in PDB and automated generation of XML files
is driven by the data dictionary infrastructure in use at the PDB. The current
XML schema file is located atdeposit.pdb.org/pdbML/pdbx-v1.000.
xsd, and on the PDB mmCIF resource page atdeposit.pdb.org/mmcif/.
SCOP scop.mrc-lmb.cam.ac.uk/scop
The Structural Classification of Proteins database classifies proteins by do-
mains that have a common ancestor based on sequence, structural, and func-
tional evidence (Murzin et al. 1995; Andreeva et al. 2004). In order to under-
stand how multidomain proteins function, it is important to know how they
are created during evolution. Duplication is one of the main sources for cre-
ating new genes and new domains (Lynch and Conery 2000). For examples
of this, see section 1.5. In fact, 98% of human protein domains are duplicates
(Gough et al. 2001; Madera et al. 2004; Muller et al. 2002). Once a domain or
protein has duplicated, it can evolve a new or modified function.
Access to SCOP requires a license. It is available in a non-XML text format.
CATH http://www.biochem.ucl.ac.uk/bsm/cath_new
This database contains domain structures classified into superfamilies and
sequence families (Orengo et al. 1997, 2003). Its name stands for Class/-
Architecture/Topology/Homology. Each structural family is expanded with
domain sequence relatives recruited from GenBank using a variety of ef-