Pattern Recognition and Machine Learning

13.2. Hidden Markov Models 623

Thus we can evaluate the likelihood function by computing this sum, for any conve-
nient choice ofn. For instance, if we only want to evaluate the likelihood function,
then we can do this by running theαrecursion from the start to the end of the chain,
and then use this result forn=N, making use of the fact thatβ(zN)is a vector of
1s. In this case noβrecursion is required, and we simply have

p(X)=

∑

zN

α(zN). (13.42)

Let us take a moment to interpret this result forp(X). Recall that to compute the
likelihood we should take the joint distributionp(X,Z)and sum over all possible
values ofZ. Each such value represents a particular choice of hidden state for every
time step, in other words every term in the summation is a path through the lattice
diagram, and recall that there are exponentially many such paths. By expressing
the likelihood function in the form (13.42), we have reduced the computational cost
from being exponential in the length of the chain to being linear by swapping the
order of the summation and multiplications, so that at each time stepnwe sum
the contributions from all paths passing through each of the statesznkto give the
intermediate quantitiesα(zn).
Next we consider the evaluation of the quantitiesξ(zn− 1 ,zn), which correspond
to the values of the conditional probabilitiesp(zn− 1 ,zn|X)for each of theK×K
settings for(zn− 1 ,zn). Using the definition ofξ(zn− 1 ,zn), and applying Bayes’
theorem, we have

ξ(zn− 1 ,zn)=p(zn− 1 ,zn|X)

=

p(X|zn− 1 ,zn)p(zn− 1 ,zn) p(X)

=

p(x 1 ,...,xn− 1 |zn− 1 )p(xn|zn)p(xn+1,...,xN|zn)p(zn|zn− 1 )p(zn− 1 ) p(X)

=

α(zn− 1 )p(xn|zn)p(zn|zn− 1 )β(zn) p(X)

(13.43)

where we have made use of the conditional independence property (13.29) together
with the definitions ofα(zn)andβ(zn)given by (13.34) and (13.35). Thus we can
calculate theξ(zn− 1 ,zn)directly by using the results of theαandβrecursions.
Let us summarize the steps required to train a hidden Markov model using
the EM algorithm. We first make an initial selection of the parametersθoldwhere
θ≡(π,A,φ). TheAandπparameters are often initialized either uniformly or
randomly from a uniform distribution (respecting their non-negativity and summa-
tion constraints). Initialization of the parametersφwill depend on the form of the
distribution. For instance in the case of Gaussians, the parametersμkmight be ini-
tialized by applying theK-means algorithm to the data, andΣkmight be initialized
to the covariance matrix of the correspondingKmeans cluster. Then we run both
the forwardαrecursion and the backwardβrecursion and use the results to evaluate
γ(zn)andξ(zn− 1 ,zn). At this stage, we can also evaluate the likelihood function.

Pattern Recognition and Machine Learning

(13.43)

Get our desktop app

Company

Features

Documentation

Resources