166 7 Sequence Similarity Searching Tools
like DNA, a protein must fold into a functionally competent 3D structure.
Summary
- In addition to BLAST searches for nucleotide and amino acid sequences,
there are search types that take into account the translation from nucleotide
to amino acid. - There are publicly available BLAST web services for searches done with
one sequence at a time. - Clusters of computers are frequently used for performing large batches of
BLAST searches.
7.4.3 Scores and Values
The output of a BLAST search consists of a set of HSPs annotated with vari-
ous measures of their statistical significance. The score of each HSP is usually
denoted bySand is called theraw score. The raw score depends on the var-
ious customization parameters of the search such as the scoring matrix. The
normalized scoreadjusts the raw score so that alignment scores from differ-
ent searches can be compared (Altschul et al. 1997). The normalized score is
S′=(λS−lnK)/ln2, whereλandKare the Karlin-Altschul statistics (Karlin
and Altschul 1990, 1993). The reason why one divides byln2is so that the
units of the normalized score are inbits, a term borrowed from information
theory (Altschul 1991). As a result,S′is also called thebit score.
The HSP with the largest score is called themaximal-scoring segment pair
(MSP). Because the MSP is the best match of the query, it is the most impor-
tant. One should be careful when using MSP scores from multiple queries.
Since the MSP score is a maximum, its probability distribution is given by
theextreme value distribution, also known as theFisher-Tippettorlog-Weibull
distribution. This distribution is not the same as a normal distribution even
when scores in general are normally distributed. This distribution is shown
in figure 7.3 where it is compared with the normal distribution.
Sequence similarity searches are commonly used to determine the func-
tionality of a sequence by comparing it with sequences whose functionality
is known. Inferring functionality is reasonable only when the similarity is
statistically significant. To determine statistical significance one compares
the actual search result with what would be expected for a search using a
random query sequence. Theexpectation valuefor a score is the number of