Catalyzing Inquiry at the Interface of Computing and Biology

94 CATALYZING INQUIRY

species (S. paradoxus, S. mikatae, and S. bayanus).^105 This analysis resulted in significant revisions of the
yeast gene catalogue, affecting approximately 15 percent of all genes and reducing the total count by
about 500 genes. Seventy-two genome-wide elements were identified, including most known regula-
tory motifs and numerous new motifs, and a putative function was inferred for most of these motifs.
The power of the comparative genomic approach arises from the fact that sequences that are positively
selected (i.e., confer some evolutionary benefit or have some useful function) tend to be conserved as a
species evolves, while other sequences are not conserved. By comparing a given genome of interest to
closely related genomes, conserved sequences become much more obvious to the observer than if the
functional elements had to be identified only by examination of the genome of interest. Thus, it is
possible, at least in principle, that functional elements can be identified on the basis of conservation
alone, without relying on previously known groups of co-regulated genes or without using data from
gene expression or transcription factor binding experiments.
Molecular phylogenetic trees that graphically represent the differences between species are usually
drawn with branch lengths proportional to the amount of evolutionary divergence between the two
nodes they connect. The longer the distance between branches, the more relatively divergent are the
sequences they represent. Methods for calculating phylogenetic trees fall into two general categories: (1)
distance-matrix methods, also known as clustering or algorithmic methods, and (2) discrete data meth-
ods. In distance-matrix methods, the percentage of sequence difference (or distance) is calculated for
pairwise combinations of all points of divergence; then the distances are assembled into a tree. In
contrast, discrete data methods examine each column of the final alignment separately and look for the
tree that best accommodates all of the information, according to optimality criteria—for example, the
tree that requires the fewest character state changes (maximum parsimony), the tree that best fits an
evolutionary model (maximum likelihood), or the tree that is most probable, given the data (Bayesian
inference). Finally, “bootstrapping” analysis tests whether the whole dataset supports the proposed tree
structure by taking random subsamples of the dataset, building trees from each of these, and calculating
the frequency with which the various parts of the proposed tree are reproduced in each of the random
subsamples.
Among the difficulties facing computational approaches to molecular phylogeny is the fact that
some sequences (or segments of sequences) mutate more rapidly than others.^106 Multiple mutations at
the same site obscure the true evolutionary difference between sequences. Another problem is the
tendency of highly divergent sequences to group together when being compared regardless of their true
relationships. This occurs because of a background noise problem—with only a limited number of
possible sequence letters (20 in the case of amino acid sequences), even divergent sequences will not
infrequently present a false phylogenetic signal due strictly to chance.

4.4.6 Mapping Genetic Variation Within a Species,

The variation that occurs between different species represents the product of reproductive isolation
and population fission over very long time scales during which many mutational changes in genes and
proteins occur. In contrast, variation within a single species is the result of sexual reproduction, genetic

(^105) M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E.S. Lander, “Sequencing and Comparison of Yeast Species to Identify
Genes and Regulatory Elements,” Nature 423(6937):241-254, 2003.
(^106) A number of interesting references to this problem can be found in the following: M.T. Holder and P.O. Lewis, “Phylogeny
Estimation: Traditional and Bayesian Approaches,” Nature Reviews Genetics 4:275-284, 2003; I. Holmes and W.J. Bruno, “Evolu-
tionary HMMs: A Bayesian approach to multiple alignment,” Bioinformatics 17(9):803-820, 2001; A. Siepel and D. Haussler,
“Combining Phylogenetic and Hidden Markov Models in Biosequence Analysis,” in Proceedings of the Seventh Annual Interna-
tional Conference on Computational Molecular Biology, Berlin, Germany, pp. 277-286, 2003; R. Durbin, S. Eddy, A. Krogh, and G.
Mitchison, Biological Sequence Analysis—Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, New York,
1998.

Catalyzing Inquiry at the Interface of Computing and Biology

94 CATALYZING INQUIRY

4.4.6 Mapping Genetic Variation Within a Species,

Get our desktop app

Company

Features

Documentation

Resources