110 5 Survey of Ontologies in Bioinformatics
tools in sequence analysis. PROSITE was developed in 1988 to systemati-
cally collect macromolecularly significant patterns (Bairoch 1991). PROSITE
is based on multiple sequence alignments (MSAs) which use two kinds of
descriptor: patterns and generalized profiles (Hulo et al. 2004). In PROSITE,
each PROSITE signature is linked to an annotation document where the user
can obtain information regarding the signature. In order to make the three-
dimensional (3D) structure more comprehensible, there are links to the rep-
resentative PDB database. PROSITE is closely related to the SWISS-PROT
protein sequence data bank.
The PROSITE descriptors and documentation can also be accessed through
InterPro, which uses the detailed family annotation provided by PRINTS
(Attwood et al. 2003). InterPro (Mulder et al. 2003) provides an integrated
view of several domain databases and offers a large choice of methods to
identify conserved regions. ClustalW (Thompson et al. 1994) or T-Coffee
(Notredame et al. 2000) are most commonly used to construct the MSAs.
However, when the primary sequences are too divergent, it is useful to inte-
grate structural information in the MSAs. In addition, about 3% of profiles
in PROSITE are built by using the HMMER hidden Markov model package
(Eddy 1998).
The PROSITE database is available as a text file. The format is defined
in a separate file and uses a variety of characters (forward slashes, commas,
semicolons, etc.) as delimiters.
BLOCKS blocks.fhcrc.org
Blocks are defined as ungapped multiple alignments corresponding to the
most conserved regions of proteins. Blocks contain “multiple alignment” in-
formation, and the use of the BLOCKS database can improve the detection of
sequence similarities in searches of sequence databases. The BLOCKS data-
base was introduced to aid in the family classification of proteins (Henikoff
and Henikoff 1991). This database turns out to be a very important database,
because hits to BLOCKS database entries pinpoint the location of conserved
motifs, which are important for further functional characterization (Henikoff
et al. 2000). Furthermore, the BLOCKS database can be used for detecting
distant relationships (Henikoff et al. 1998). The BLOCKS database is the ba-
sis for the BLOSUM substitution tables that are used in amino acid sequence
similarity searching, as explained in section 7.1.
The BLOCKS database contains more than 24,294 blocks from nearly 5000
different protein groups (Henikoff et al. 2000). There are a variety of for-
mats for blocks, including the Blocks, FASTA, and Clustal formats. All of the