Telling the Evolutionary Time: Molecular Clocks and the Fossil Record

(Grace) #1
Maximum likelihood

Using these nucleotide, amino acid or codon substitution models, genetic distances can be
estimated and ‘corrected’ from the data using a ML algorithm (Felsenstein 1981). The ML
framework allows the comparison of different models of sequence evolution using likelihood
ratio testing (LRT) (Felsenstein 1981; Muse and Gaut 1994). In LRT, these nested
hypotheses of sequence evolution can be expressed as a ratio, and if the more specific model
provides a significant improvement in likelihood value it is determined to be a better fit to
the dataset. In this way the test can ascertain whether incorporating extra parameters into
a model provides a better fit to the dataset. Model selectors such as that of Posada and
Crandall (2001) utilize LRTs in order to distinguish between the fit of different models to
datasets.


Parameters

Additional parameters can be incorporated into substitution matrices in order to relax
certain assumptions about rates across sites. Such parameters often provide a significant
improvement in fit to data.
Rates of substitution can vary greatly between individual sites (nucleotides, amino acids,
or codons) in genes. When this is not allowed for, it causes an underestimation bias in the
observed versus true distance, which increases with depth of divergence (Adachi and
Hasegawa 1995; Yang 1996). The best models to accommodate the variety of observed
patterns are those that incorporate both rate heterogeneity (Yang 1996) and invariant sites
data (Steel et al. 2000).


Non-stationarity

The stationarity hypothesis proposes constancy of base composition over the whole tree.
Non-stationarity is often overlooked because of the computation required to deal with it.
Ignoring it can lead to the construction of erroneous phylogenies with high bootstrap support
(Phillips et al. 2001; Tarrío et al. 2001). Galtier and Gouy (1998) reported a method which
can accommodate non-stationarity. This has been shown to resolve the correct phylogenetic
tree in cases where the LogDet method failed to do so.


Rate heterogeneity

Rate heterogeneity across sites is usually modelled with a discrete gamma distribution,
shaped by an estimated alpha value (Yang 1996). When gamma values are included in ML
models and estimated from the data they often provide a significant improvement in
likelihood. Gamma may not provide an improvement, however, when alpha values are
assumed without estimation from the data. Bromham et al. (1998) argued this point with
respect to work by Ayala et al. (1998) who assumed a protein alpha value of two across all
genes rather than making an estimation from the dataset. Ayala et al. (1998) generated a
noticeably shallower divergence estimate (Figure 3.3) compared with studies such as Wray
et al. (1996) where gamma was estimated from the data. Hidden Markov models (HMM)


PHYLOGENETIC FUSES AND EVOLUTIONARY ‘EXPLOSIONS’ 57
Free download pdf