162 7 Sequence Similarity Searching Tools
Before discussing the algorithmic details of BLAST, it is helpful to mention
some key concepts. Asegmentis a consecutive sequence of letters from the
DNA or amino acid alphabet. Dashes are used to denote gaps in a sequence.
Asegment pairis a pair of segments having the same length.
The BLAST algorithm consists of the following steps:
- Preprocessing step. BLAST masks (omits) simple, low-complexity re-
gions in the query sequence because these regions are not biologically
informative. - Word generation step. BLAST generates a list ofwords(i.e., short se-
quences) from the query sequence. The default word lengths are 3 (for
amino acid sequences) and 11 (for nucleotide sequences). The words are
then matched with the database to find the matches whose score exceeds a
given thresholdT. Such matches are calledhits. BLAST uses BLOSUM62
as the default scoring matrix for amino acids to find the hits. No gaps are
allowed during this step of the algorithm. A higher thresholdTincreases
the speed but also increases the probability that biologically significant
segment pairs will be missed. Thus there is a tradeoff between speed
and sensitivity that can be adjusted according to individual needs and the
resources available.
A search across the entire target sequence database for exact matches of
the hits is then performed. The search makes use of database indexes for
efficiency. As a result, this part of the algorithm is very fast. Matches ob-
tained in this part of the algorithm are theseedsfor a potential alignment
between the query and database sequences. - Word extension step. In this step some of the hits obtained in the word
generation step are extended to find full alignment matches. The original
BLAST algorithm (now called theone-hitmethod) uses all of the hits with
a score above the threshold. It then attempts to extend the alignment from
each matching word in both directions as long as the alignment score is
no worse than an amountXbelow the maximum score attained so far.
The resulting alignment of an extended word is called a high-scoring seg-
ment pair (HSP). The extension step is computationally expensive: ifT
andXare chosen to attain a reasonable sensitivity, the extension step will
typically account for more than 90% of the execution time of BLAST. The
original BLAST algorithm did not allow gaps in matching alignments.
More recently the BLAST algorithm has used a different technique for
selecting the hits that will be extended. In thetwo-hitmethod, an HSP