92 CATALYZING INQUIRY
4.4.5 Sequence Alignment and Evolutionary Relationships,
A remarkable degree of similarity exists among the genomes of living organisms.^104 Information
about the similarities and dissimilarities of different types of organisms presents a picture of relatedness
between species (i.e., between reproductive groups), but also must provide useful clues to the impor-
tance, structure, and function of genes and proteins carried or lost over time in different species.
“Comparative genomics” has become a new discipline within biology to study these relationships.
Intergenic
Region
Start Codon
(ATG)
Coding
Region
Stop Codon
(TAA)
Emission
Probability
A
C
T
G
TP = 1.0
TP = 1.0
Emission
Probability
A
C
T
G
TP =.9
TP =.9
TP =.1
TP = .1
.25
.25
.25
.25
.9
.03
.03
.04
FIGURE 4.3 Hidden Markov model of a compositional signal and search approach for finding a gene in a bacterial
genome.
The model has four features: (1) state of the sequence, of which four states are possible (coding, intergenic, start,
and stop); (2) outputs, defined as the possible nucleotide(s) that can exist at any given state (A, C, T, G at coding
and intergenic states; ATG and TAA at start and stop states, respectively); (3) emission possibilities—the probabil-
ity that a given nucleotide will be generated in any particular state; and (4) transition probability (TP)—the proba-
bility that the sequence is in transition between two states.
To execute the model, emission and transition probabilities are obtained by training on the characterized genes
in the reference database. The set of all possible combinations of states for the query sequence is then generated,
and an overall probability for each combination of states is calculated. If the combination having the highest
overall probability exceeds a threshold determined using gene sequences in the reference database, the query
sequence is concluded to be a gene.
(^104) For example, 9 percent of E. coli genes, 9 percent of rice genes, 30 percent of yeast genes, 43 percent of mosquito genes, 75
percent of zebrafish genes, and 94 percent of rat genes have homologs in humans. See http://iubio.bio. Indiana.edu:8089/all/
hgsummary.html (Summary Table August 2005).