Pattern Recognition and Machine Learning

13.2. Hidden Markov Models 617

and make use of the definitions ofγandξ, we obtain

Q(θ,θold)=

∑K

k=1

γ(z 1 k)lnπk+

∑N

n=2

∑K

j=1

∑K

k=1

ξ(zn− 1 ,j,znk)lnAjk

+

∑N

n=1

∑K

k=1

γ(znk)lnp(xn|φk). (13.17)

The goal of the E step will be to evaluate the quantitiesγ(zn)andξ(zn− 1 ,zn)effi-
ciently, and we shall discuss this in detail shortly.
In the M step, we maximizeQ(θ,θold)with respect to the parametersθ=
{π,A,φ}in which we treatγ(zn)andξ(zn− 1 ,zn)as constant. Maximization with
respect toπandAis easily achieved using appropriate Lagrange multipliers with
Exercise 13.5 the results

πk =

γ(z 1 k) ∑K

j=1

γ(z 1 j)

(13.18)

Ajk =

∑N

n=2

ξ(zn− 1 ,j,znk)

∑K

l=1

∑N

n=2

ξ(zn− 1 ,j,znl)

. (13.19)

The EM algorithm must be initialized by choosing starting values forπandA, which
should of course respect the summation constraints associated with their probabilis-
tic interpretation. Note that any elements ofπorAthat are set to zero initially will
Exercise 13.6 remain zero in subsequent EM updates. A typical initialization procedure would
involve selecting random starting values for these parameters subject to the summa-
tion and non-negativity constraints. Note that no particular modification to the EM
results are required for the case of left-to-right models beyond choosing initial values
for the elementsAjkin which the appropriate elements are set to zero, because these
will remain zero throughout.
To maximizeQ(θ,θold)with respect toφk, we notice that only the final term
in (13.17) depends onφk, and furthermore this term has exactly the same form as
the data-dependent term in the corresponding function for a standard mixture dis-
tribution for i.i.d. data, as can be seen by comparison with (9.40) for the case of a
Gaussian mixture. Here the quantitiesγ(znk)are playing the role of the responsibil-
ities. If the parametersφkare independent for the different components, then this
term decouples into a sum of terms one for each value ofk, each of which can be
maximized independently. We are then simply maximizing the weighted log likeli-
hood function for the emission densityp(x|φk)with weightsγ(znk). Here we shall
suppose that this maximization can be done efficiently. For instance, in the case of

Pattern Recognition and Machine Learning

+

(13.18)

. (13.19)

Get our desktop app

Company

Features

Documentation

Resources