untitled

(ff) #1

7


Sequence Similarity Searching


Tools


Information retrieval can take many forms, and does not have to be based
on natural language. In bioinformatics, it is very common to base queries on
biological sequences, the biochemical language of cells. Indeed, most predic-
tions of biological function are obtained by comparing new sequence data
(for which little is known) with existing data (for which there is prior know-
ledge). The comparison is performed by using the new sequence data as a
query to retrieve similar sequence data in a corpus of such data. Such com-
parisons are of fundamental importance in computational biology. Similar
sequences are referred to as beinghomologous.
In this chapter we present the basic concepts necessary for sequence sim-
ilarity and the main approaches and tools for sequence similarity search-
ing. The most commonly used sequence similarity searching tools in com-
putational biology are FASTA, Basic Local Alignment Search Tool (BLAST),
and the many variations of BLAST. All these algorithms search a sequence
database for the closest matches to a query sequence. It should be noted that
all three algorithms are database search heuristics, which may completely
miss some significant matches and may produce nonoptimal matches. Of
these three tools, BLAST is the most heavily used sequence analysis tool
available in the public domain.

7.1 Basic Concepts


Like information retrieval, sequence similarity searching is a process whereby
a relatively small “query” sequence is compared with a large genomic “cor-
pus” of sequence information. In a perfect match the query sequence occurs
as a subsequence of the corpus. In practice such perfect matches seldom oc-
cur so it is necessary to have a measure of similarity. Each potential match is
Free download pdf