Catalyzing Inquiry at the Interface of Computing and Biology

(nextflipdebug5) #1
90 CATALYZING INQUIRY

Regions of the genome that are not transcribed from DNA into RNA include biological signals (such as
promoters) that flank the coding sequence and regulate the gene’s transcription. Other untranscribed regions of
unknown purpose are found between genes or interspersed within coding sequences.
Genes themselves can occasionally be found nested within one another, and overlapping genes have
been shown to exist on the same or opposite DNA strands.^93 The presence of pseudogenes (nonfunctional
sequences resembling real genes), which are distributed in numerous copies throughout a genome, further
complicates the identification of true protein-coding genes.^94 Finally, it is known that most genes are ulti-
mately translated into more than one protein through a process that is not completely understood. In the
process of transcription, the exons of a particular gene are assembled into a single mature mRNA. However,
in a process known as alternate splicing, various splicings omit certain exons, resulting in a family of variants
(“splice variants”) in which the exons remain in sequence, but some are missing. It is estimated that at least
a third of human genes are alternatively spliced,^95 with certain splicing arrangements occurring more
frequently than others. Protein splicing and RNA editing also play an important role. To understand gene
structures completely, all of these sequence features have to be anticipated by gene recognition tools.
Two basic approaches have been established for gene recognition: the sequence similarity search, or
lookup method, and the integrated compositional and signal search, or template method (also known as
ab initio gene finding).^96 Sequence similarity search is a well-established computational method for gene
recognition based on the conservation of gene sequences (called homology) in evolutionarily related
organisms. A sequence similarity search program compares a query sequence (an uncharacterized se-
quence) of interest with already characterized sequences in a public sequence database (e.g., databases of
the Institute of Genomic Research (TIGR)^97 ) and then identifies regions of similarity between the se-
quences. A query sequence with significant similarity to the sequence of an annotated (characterized) gene
in the database suggests that the two sequences are homologous and have common evolutionary origin.
Information from the annotated DNA sequence or the protein coded by the sequence can potentially be
used to infer gene structure or function of the query sequence, including promoter elements, potential
splice sites, start and stop codons, and repeated segments. Alignment tools, such as BLAST,^98 FASTA, and
Smith-Waterman, have been used to search for the homologous genes in the database.
Although sequence similarity search has been proven useful in many cases, it has fundamental
limitations. Manning et al. note in their work on the protein kinase complement of the human genome


(^93) I. Dunham, L.H. Matthews, J. Burton, J.L. Ashurst, K.L. Howe, K.J. Ashcroft, D.M. Beare, et al., “The DNA Sequence of
Human Chromosome 22,” Nature 402(6982):489-495, 1999.
(^94) A mitigating factor is that pseudogenes are generally not conserved between species (see, for example, S. Caenepeel, G.
Charydezak, S. Sudarsanam, T. Hunter, and G. Manning, “The Mouse Kinome: Discovery and Comparative Genomics of All
Mouse Protein Kinases,” Proceedings of the National Academy of Sciences 101(32):11707-11712, 2004). This fact provides another clue
in deciding which sequences represent true genes and which represent pseudogenes.
(^95) D. Brett, J. Hanke, G. Lehmann, S. Haase, S. Delbruck, S. Krueger, J. Reich, and P. Bork, “EST Comparison Indicates 38% of
Human mRNAs Contain Possible Alternative Splice Forms,” FEBS Letters 474(1):83-86, 2000.
(^96) J.W. Fickett, “Finding Genes by Computer: The State of the Art,” Trends in Genetics 12(8):316-320, 1996.
(^97) See http://www.tigr.org/tdb/.
(^98) The BLAST 2.0 algorithm, perhaps the most commonly used tool for searching large databases of gene or protein sequences, is
based on the idea that sequences that are truly homologous will contain short segments that will match almost perfectly. BLAST was
designed to be fast while maintaining the sensitivity needed to detect homology in distantly related sequences. Rather than aligning
the full length of a query sequence against all of the sequences in the reference database, BLAST fragments the reference sequences into
sub-sequences or “words” (11 nucleotides long for gene search) constituting a dictionary against which a query sequence is matched.
The program creates a list of all the reference words that show up in the query sequence and then looks for pairs of those words that
occur at adjacent positions on different sequences in the reference database. BLAST uses these “seed” positions to narrow candidate
matches and to serve as the starting point for the local alignment of the query sequence. In local alignment, each nucleotide position in
the query receives a score relative to how well the query and reference sequence match; perfect matches score highest, substitutions of
different nucleotides incur different penalties. Alignment is continued outward from the seed positions until the similarity of query
and reference sequences drops below a predetermined threshold. The program reports the highest scoring alignments, described by an
E-value, the probability that an alignment with this score would be observed by chance. See, for example, S.F. Altschul, T.L. Madden,
A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein
Database Search Programs,” Nucleic Acids Research 25(17):3389-3402, 1997.

Free download pdf