13.2. Hidden Markov Models 633
Figure 13.18 Example of an input-output hidden
Markov model. In this case, both the
emission probabilities and the transition
probabilities depend on the values of a
sequence of observationsu 1 ,...,uN.
zn− 1 zn zn+1
xn− 1 xn xn+1
un− 1 un un+1
the two preceding observed variables as well as on the hidden state. Although this
graph looks messy, we can again appeal to d-separation to see that in fact it still has
a simple probabilistic structure. In particular, if we imagine conditioning onznwe
see that, as with the standard HMM, the values ofzn− 1 andzn+1are independent,
corresponding to the conditional independence property (13.5). This is easily veri-
fied by noting that every path from nodezn− 1 to nodezn+1passes through at least
one observed node that is head-to-tail with respect to that path. As a consequence,
we can again use a forward-backward recursion in the E step of the EM algorithm to
determine the posterior distributions of the latent variables in a computational time
that is linear in the length of the chain. Similarly, the M step involves only a minor
modification of the standard M-step equations. In the case of Gaussian emission
densities this involves estimating the parameters using the standard linear regression
equations, discussed in Chapter 3.
We have seen that the autoregressive HMM appears as a natural extension of the
standard HMM when viewed as a graphical model. In fact the probabilistic graphical
modelling viewpoint motivates a plethora of different graphical structures based on
the HMM. Another example is theinput-outputhidden Markov model (Bengio and
Frasconi, 1995), in which we have a sequence of observed variablesu 1 ,...,uN,in
addition to the output variablesx 1 ,...,xN, whose values influence either the dis-
tribution of latent variables or output variables, or both. An example is shown in
Figure 13.18. This extends the HMM framework to the domain of supervised learn-
ing for sequential data. It is again easy to show, through the use of the d-separation
criterion, that the Markov property (13.5) for the chain of latent variables still holds.
To verify this, simply note that there is only one path from nodezn− 1 to nodezn+1
and this is head-to-tail with respect to the observed nodezn. This conditional inde-
pendence property again allows the formulation of a computationally efficient learn-
ing algorithm. In particular, we can determine the parametersθof the model by
maximizing the likelihood functionL(θ)=p(X|U,θ)whereUis a matrix whose
rows are given byuTn. As a consequence of the conditional independence property
(13.5) this likelihood function can be maximized efficiently using an EM algorithm
Exercise 13.18 in which the E step involves forward and backward recursions.
Another variant of the HMM worthy of mention is thefactorial hidden Markov
model(Ghahramani and Jordan, 1997), in which there are multiple independent