COMPUTATIONAL TOOLS 93
Alignments of gene and protein sequences from many different organisms are used to find diagnostic
patterns to characterize protein families; to detect or demonstrate homologies between new sequences
and existing families of sequences; to help predict the secondary and tertiary structures of new se-
quences; and to serve as an essential prelude to molecular evolutionary analysis.
To visualize relationships between genomes, evolutionary biologists develop phylogenetic trees
that portray groupings of organisms, characteristics, genes, or proteins based on their common ances-
tries and the set of common characters they have inherited. One type of molecular phylogenetic tree, for
example, might represent the amino acid sequence of a protein found in several different species. The
tree is created by aligning the amino acid sequences of the protein in question from different species,
determining the extent of differences between them (e.g., insertions, deletions, or substitutions of amino
acids), and calculating a measure of relatedness that is ultimately reflected in a drawing of a tree with
nodes and branches of different lengths.
The examination of phylogenetic relationships of sequences from several different species gen-
erally uses a method known as progressive sequence alignment, in which closely related sequences
are aligned first, and more distant ones are added gradually to the alignment. Attempts at tackling
multiple alignments simultaneously have been limited to small numbers of short sequences be-
cause of the computational power needed to resolve them. Therefore, alignments are most often
undertaken in a stepwise fashion. The algorithm of one commonly used program (ClustalW) con-
sists of three main stages. First, all pairs of sequences are aligned separately in order to calculate a
distance matrix giving the divergence of each pair of sequences; second, a guide tree is calculated
from the distance matrix; and third, the sequences are progressively aligned according to the
branching order in the guide tree.
Alignment algorithms that test genetic similarity face several challenges. The basic premise of a
multiple sequence alignment is that, for each column in the alignment, every residue from every se-
quence is homologous (i.e., has evolved from the same position in a common ancestral sequence). In the
process of comparing any two amino acid sequences, the algorithm must place gaps or spaces at points
throughout the sequences to get the sequences to align. Because inserted gaps are carried forward into
subsequent alignments with additional new sequences, the cumulative alignment of multiple sequences
can become riddled with gaps that sometimes result in an overall inaccurate picture of relationships
between the proteins. To address this problem, gap penalties based on a weight matrix of different
factors are incorporated into the algorithm. For example, the penalty for introducing a gap in aligning
two similar sequences is greater that that for aligning two dissimilar sequences. Gap penalties differ
depending on the length of the sequence, the types of sequence, and different regions of sequence.
Based on the weight matrix and rules for applying penalties, the algorithm compromises in the place-
ment of gaps to obtain the lowest penalty score for each alignment.
The placement of a gap in a protein sequence may represent an evolutionary change—if a gap,
reflecting the putative addition or subtraction of an amino acid to a protein’s structure, is introduced,
the function of the protein may change, and the change may have evolutionary benefit. However, the
change may also be insignificant from a functional point of view. Today, it is known that most insertions
and deletions occur in loops on the surface of the protein or between domains of multidomain proteins,
which means that knowledge of the three-dimensional structure or the domain structure of the protein
can be used to help identify functionally important deletions and insertions.
As the structures of different protein domains and families are increasingly determined by other
means, alignment algorithms that incorporate such information should become more accurate. More
recently, stochastic and iterative optimization methods are being used to refine individual alignments.
Also, some algorithms (e.g., Bioedit) allow users to manually edit the alignment when other information
or “eyeballing” suggests logical placement of gaps.
Exploitation of complete genomic knowledge across closely related species can play an important
role in identifying the functional elements encoded in a genome. Kellis et al. undertook a comparative
analysis of the yeast Saccharomyces cerevisiae based on high-quality draft sequences of three related