Pattern Recognition and Machine Learning

10.2. Illustration: Variational Mixture of Gaussians 481

Indeed, these singularities are removed if we simply introduce a prior and then use a
MAP estimate instead of maximum likelihood. Furthermore, there is no over-fitting
if we choose a large numberKof components in the mixture, as we saw in Fig-
ure 10.6. Finally, the variational treatment opens up the possibility of determining
the optimal number of components in the mixture without resorting to techniques
Section 10.2.4 such as cross validation.

10.2.2 Variational lower bound

We can also straightforwardly evaluate the lower bound (10.3) for this model. In practice, it is useful to be able to monitor the bound during the re-estimation in order to test for convergence. It can also provide a valuable check on both the mathematical expressions for the solutions and their software implementation, because at each step of the iterative re-estimation procedure the value of this bound should not decrease. We can take this a stage further to provide a deeper test of the correctness of both the mathematical derivation of the update equations and of their software implementation by using finite differences to check that each update does indeed give a (constrained) maximum of the bound (Svensen and Bishop, 2004). ́ For the variational mixture of Gaussians, the lower bound (10.3) is given by

L =

∑

Z

∫∫∫ q(Z,π,μ,Λ)ln

{ p(X,Z,π,μ,Λ) q(Z,π,μ,Λ)

} dπdμdΛ

= E[lnp(X,Z,π,μ,Λ)]−E[lnq(Z,π,μ,Λ)] = E[lnp(X|Z,μ,Λ)] +E[lnp(Z|π)] +E[lnp(π)] +E[lnp(μ,Λ)] −E[lnq(Z)]−E[lnq(π)]−E[lnq(μ,Λ)] (10.70)

where, to keep the notation uncluttered, we have omitted the superscript on the
qdistributions, along with the subscripts on the expectation operators because each
expectation is taken with respect to all of the random variables in its argument. The
Exercise 10.16 various terms in the bound are easily evaluated to give the following results

E[lnp(X|Z,μ,Λ)] =

1

2

∑K

k=1

Nk

{ ln ̃Λk−Dβ−k^1 −νkTr(SkWk)

−νk(xk−mk)TWk(xk−mk)−Dln(2π)

} (10.71)

E[lnp(Z|π)] =

∑N

n=1

∑K

k=1

rnklnπ ̃k (10.72)

E[lnp(π)] = lnC(α 0 )+(α 0 −1)

∑K

k=1

ln ̃πk (10.73)

Pattern Recognition and Machine Learning

10.2.2 Variational lower bound

L =

1

2

Get our desktop app

Company

Features

Documentation

Resources