untitled

(ff) #1

156 7 Sequence Similarity Searching Tools


a way of lining up the residues in the query sequence with part of a sequence
in the corpus. Such a lining up is called analignment. In an alignment, the
match can fail to be an exact match in two ways: aligned residues can be
different and there may be gaps in one sequence relative to the other. For
each alignment one can compute a similarity measure orscorebased on the
residues that match or fail to match and the sizes of the gaps. Matches gen-
erally contribute positively to the overall score while mismatches and gaps
contribute negatively. Thescoring matrixspecifies the contribution to the
overall score of each possible match and mismatch. This contribution can
be dependent on the position of a residue in the query sequence, in which
case the scoring matrix is called aposition-specific scoring matrix(PSSM). Such
matrices are also called “profiles” or “motifs.” If the contributions do not
depend on positions, then the scoring matrix specifies the score associated
with a substitution of one type of residue for another. Such a scoring matrix
is called asubstitution matrix.Thegap penaltiesspecify the effect of gaps on
the score. The objective of a sequence similarity matching tool is to find the
alignments with the best overall score.
There are a number of ways to compute the alignment score. The pri-
mary distinction is between nucleotide sequences and amino acid sequences.
The scoring for amino acid sequence similarity is more complicated because
there are more kinds of amino acids and because amino acid properties are
more complicated than nucleotide properties. For example, chemical struc-
tures and amino acid frequencies can both be taken into consideration. If two
aligned residues have a very low probability of being homologous, a heavy
penalty score is given for such a mismatch. Protein evolution is believed to
be subject to stronger forces than DNA evolution, so that some amino acid
substitutions (which result in Mendelian disorders) are much less function-
ally tolerant than others because natural selection processes select against
them.
The two most commonly used substitution matrices for amino acids are
the point accepted mutation (PAM) (Dayhoff et al. 1978) and the blocks sub-
stitution matrix (BLOSUM) (Henikoff and Henikoff 1992). BLOSUM is more
popular than PAM. In both cases, the entries in the matrix have the form
sij =ClogC(rij), whereCdetermines the units by which the entries are
scaled (usually 2 for BLOSUM and 10 for PAM) andrijis the ratio of the
estimated frequency with which the amino acidsiandjare substituted due
to evolutionary descent, to the frequency with which they would be substi-
tuted by chance. The numerator of this ratio is computed by using a sample
of known alignments. This formula is known more succinctly as thelog-odds
Free download pdf