10.2. Illustration: Variational Mixture of Gaussians 481
Indeed, these singularities are removed if we simply introduce a prior and then use a
MAP estimate instead of maximum likelihood. Furthermore, there is no over-fitting
if we choose a large numberKof components in the mixture, as we saw in Fig-
ure 10.6. Finally, the variational treatment opens up the possibility of determining
the optimal number of components in the mixture without resorting to techniques
Section 10.2.4 such as cross validation.
10.2.2 Variational lower bound
We can also straightforwardly evaluate the lower bound (10.3) for this model.
In practice, it is useful to be able to monitor the bound during the re-estimation in
order to test for convergence. It can also provide a valuable check on both the math-
ematical expressions for the solutions and their software implementation, because at
each step of the iterative re-estimation procedure the value of this bound should not
decrease. We can take this a stage further to provide a deeper test of the correctness
of both the mathematical derivation of the update equations and of their software im-
plementation by using finite differences to check that each update does indeed give
a (constrained) maximum of the bound (Svensen and Bishop, 2004). ́
For the variational mixture of Gaussians, the lower bound (10.3) is given by
L =
∑
Z
∫∫∫
q(Z,π,μ,Λ)ln
{
p(X,Z,π,μ,Λ)
q(Z,π,μ,Λ)
}
dπdμdΛ
= E[lnp(X,Z,π,μ,Λ)]−E[lnq(Z,π,μ,Λ)]
= E[lnp(X|Z,μ,Λ)] +E[lnp(Z|π)] +E[lnp(π)] +E[lnp(μ,Λ)]
−E[lnq(Z)]−E[lnq(π)]−E[lnq(μ,Λ)] (10.70)
where, to keep the notation uncluttered, we have omitted the superscript on the
qdistributions, along with the subscripts on the expectation operators because each
expectation is taken with respect to all of the random variables in its argument. The
Exercise 10.16 various terms in the bound are easily evaluated to give the following results
E[lnp(X|Z,μ,Λ)] =
1
2
∑K
k=1
Nk
{
ln ̃Λk−Dβ−k^1 −νkTr(SkWk)
−νk(xk−mk)TWk(xk−mk)−Dln(2π)
}
(10.71)
E[lnp(Z|π)] =
∑N
n=1
∑K
k=1
rnklnπ ̃k (10.72)
E[lnp(π)] = lnC(α 0 )+(α 0 −1)
∑K
k=1
ln ̃πk (10.73)