Pattern Recognition and Machine Learning

(Jeff_L) #1
13.2. Hidden Markov Models 631

Intuitively, we can understand the Viterbi algorithm as follows. Naively, we
could consider explicitly all of the exponentially many paths through the lattice,
evaluate the probability for each, and then select the path having the highest proba-
bility. However, we notice that we can make a dramatic saving in computational cost
as follows. Suppose that for each path we evaluate its probability by summing up
products of transition and emission probabilities as we work our way forward along
each path through the lattice. Consider a particular time stepnand a particular state
kat that time step. There will be many possible paths converging on the correspond-
ing node in the lattice diagram. However, we need only retain that particular path
that so far has the highest probability. Because there areKstates at time stepn,we
need to keep track ofKsuch paths. At time stepn+1, there will beK^2 possible
paths to consider, comprisingKpossible paths leading out of each of theKcurrent
states, but again we need only retainKof these corresponding to the best path for
each state at timen+1. When we reach the final time stepNwe will discover which
state corresponds to the overall most probable path. Because there is a unique path
coming into that state we can trace the path back to stepN− 1 to see what state it
occupied at that time, and so on back through the lattice to the staten=1.


13.2.6 Extensions of the hidden Markov model


The basic hidden Markov model, along with the standard training algorithm
based on maximum likelihood, has been extended in numerous ways to meet the
requirements of particular applications. Here we discuss a few of the more important
examples.
We see from the digits example in Figure 13.11 that hidden Markov models can
be quite poor generative models for the data, because many of the synthetic digits
look quite unrepresentative of the training data. If the goal is sequence classifica-
tion, there can be significant benefit in determining the parameters of hidden Markov
models using discriminative rather than maximum likelihood techniques. Suppose
we have a training set ofRobservation sequencesXr, wherer=1,...,R, each of
which is labelled according to its classm, wherem=1,...,M. For each class, we
have a separate hidden Markov model with its own parametersθm, and we treat the
problem of determining the parameter values as a standard classification problem in
which we optimize the cross-entropy


∑R

r=1

lnp(mr|Xr). (13.72)

Using Bayes’ theorem this can be expressed in terms of the sequence probabilities
associated with the hidden Markov models


∑R

r=1

ln

{
p(Xr|θr)p(mr)
∑M
l=1p(Xr|θl)p(lr)

}

(13.73)

wherep(m)is the prior probability of classm. Optimization of this cost function
is more complex than for maximum likelihood (Kapadia, 1998), and in particular

Free download pdf