160 7 Sequence Similarity Searching Tools
ktupis 2 for amino acid and 6 for nucleotide sequences. The next step is to
extend the matches of lengthktupto obtain the highest scoring ungapped
regions. In the third step, these ungapped regions are assessed to determine
whether they could be joined together with gaps, taking into account the
gap penalties. Finally the highest scoring candidates of the third step are re-
aligned using the full Smith-Waterman algorithm, but confining the dynamic
programming matrix to a subregion around the candidates. The trade-off
between speed and sensitivity is determined by the value of thektupparam-
eter. Higher values ofktup, which represent higher “word” sizes, will give
rise to a smaller number of exact hits and hence a lower sensitivity, but will
result in a faster search. For the purpose of tuning, thektupparameter will
generally be either 1 or 2 for amino acid sequences and can range from 4 to 6
for nucleotide sequences.
A sequence file in FASTA format can contain several sequences. Each se-
quence in FASTA format begins with a single-line description, followed by
lines of sequence data. The description line must begin with a greater-than
symbol (>) in the first column. An example sequence in FASTA format is
shown in figure 7.1.
>gi|11066424|gb|AF200505.1|AF200503S3 Pongo pygmaeus
GGCGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAAC
TGGAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCC
AAGGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGT
GCGCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCC
AGAGCACCGAGGAGCTGCGGGCGCGCCTCGCCTCCCACCTGCGCAAGCTG
CGCAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGTCTGGCAGT
GTACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCGTCAGCGCCATCC
GCGAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACT
GTGGGCTCCGTGGCCGGCAAGCCGCTGCAGGAGCGGGCCCAGGCCTGGGG
CGAGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACC
GCCTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAG
GAGCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCT
CAAGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCG
GGCTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTG
CCCAGCGACAATCACTGA
Figure 7.1 FASTA format of a 718-bp DNA sequence (GenBank accession number
AF200505.1) encoding exon 4 ofPongo pygmaeusapolipoprotein E (ApoE) gene.