7.1 Basic Concepts 157
formula. Logarithms are used so that total scores can be computed by adding
the scores for individual residues in the alignment. Vector space retrieval for
text databases uses the same technique. For convenience,sijis often rounded
to the nearest integer.
BLOSUM matrices were based on data derived from the BLOCKS database
(Henikoff and Henikoff 1991, 1994), which is a set of ungapped alignments
of protein families (i.e., structurally and functionally related proteins). Using
about 2000 blocks of such aligned sequence segments, the sequences of each
block are sorted into closely related clusters, and the probability of a mean-
ingful amino acid substitution is calculated based on the frequencies of sub-
stitutions among these clusters within a family. The number associated with
a BLOSUM matrix (such as BLOSUM62) indicates the cut-off value for per-
centage sequence identity that defines the clusters. In particular, BLOSUM62
scores alignments with sequence identity at most 62%. Note that a lower
cut-off value would allow for more diverse sequences into groups, and the
corresponding matrices are therefore appropriate for examining more distant
relationships.
The PAM matrices are based on taking sets of high-confidence alignments
of many homologous proteins and assessing the frequencies of all substi-
tutions. The PAM matrices were calculated based on a certain model of
evolutionary distance from alignments of closely related sequences (about
85% identical) from 34 “superfamilies” grouped into 71 evolutionary trees
and containing 1572 point mutations. Phylogenetic trees were reconstructed
based on these sequences to determine the ancestral sequence for each align-
ment. Substitutions were tallied by type, normalized over usage frequencies,
and then converted to log-odds scores. The value in a PAM1 matrix repre-
sents the probability that 1 out of 100 amino acids will undergo substitution.
Multiplying PAM1 by itself generates PAM2, and more generally (PAM1)nis
a scoring matrix for amino acid sequences that have undergone n multiple
and independent steps of mutations. Thus, the PAM250 matrix has under-
gone 130 more steps of mutations than the PAM120 matrix. Hence, for align-
ing closely related amino acid sequences, PAM120 matrix is a good choice;
for aligning more distantly related amino acid sequences, PAM250 matrix is
a more appropriate choice. It should be noted that errors can be amplified
during the multiplication process, and thus higher-order PAM matrices are
more error-prone. By comparison, in a BLOSUM62 matrix, each value is cal-
culated by dividing the frequency of occurrence of the amino acid pair in the
BLOCKS database, “clustered” at the 62% level, by the probability that the
same amino acid pair aligns purely by chance. PAM matrices are scaled in