untitled

(ff) #1

162 7 Sequence Similarity Searching Tools


Before discussing the algorithmic details of BLAST, it is helpful to mention
some key concepts. Asegmentis a consecutive sequence of letters from the
DNA or amino acid alphabet. Dashes are used to denote gaps in a sequence.
Asegment pairis a pair of segments having the same length.
The BLAST algorithm consists of the following steps:


  1. Preprocessing step. BLAST masks (omits) simple, low-complexity re-
    gions in the query sequence because these regions are not biologically
    informative.

  2. Word generation step. BLAST generates a list ofwords(i.e., short se-
    quences) from the query sequence. The default word lengths are 3 (for
    amino acid sequences) and 11 (for nucleotide sequences). The words are
    then matched with the database to find the matches whose score exceeds a
    given thresholdT. Such matches are calledhits. BLAST uses BLOSUM62
    as the default scoring matrix for amino acids to find the hits. No gaps are
    allowed during this step of the algorithm. A higher thresholdTincreases
    the speed but also increases the probability that biologically significant
    segment pairs will be missed. Thus there is a tradeoff between speed
    and sensitivity that can be adjusted according to individual needs and the
    resources available.
    A search across the entire target sequence database for exact matches of
    the hits is then performed. The search makes use of database indexes for
    efficiency. As a result, this part of the algorithm is very fast. Matches ob-
    tained in this part of the algorithm are theseedsfor a potential alignment
    between the query and database sequences.

  3. Word extension step. In this step some of the hits obtained in the word
    generation step are extended to find full alignment matches. The original
    BLAST algorithm (now called theone-hitmethod) uses all of the hits with
    a score above the threshold. It then attempts to extend the alignment from
    each matching word in both directions as long as the alignment score is
    no worse than an amountXbelow the maximum score attained so far.
    The resulting alignment of an extended word is called a high-scoring seg-
    ment pair (HSP). The extension step is computationally expensive: ifT
    andXare chosen to attain a reasonable sensitivity, the extension step will
    typically account for more than 90% of the execution time of BLAST. The
    original BLAST algorithm did not allow gaps in matching alignments.
    More recently the BLAST algorithm has used a different technique for
    selecting the hits that will be extended. In thetwo-hitmethod, an HSP

Free download pdf