484 10. APPROXIMATE INFERENCE
Figure 10.7 Plot of the variational lower bound
Lversus the numberKof com-
ponents in the Gaussian mixture
model, for the Old Faithful data,
showing a distinct peak atK =
2 components. For each value
ofK, the model is trained from
100 different random starts, and
the results shown as ‘+’ symbols
plotted with small random hori-
zontal perturbations so that they
can be distinguished. Note that
some solutions find suboptimal
local maxima, but that this hap-
pens infrequently.
K
p(D|K)
1 2 3 4 5 6
parameter values. We have seen in Figure 10.2 that if the true posterior distribution
is multimodal, variational inference based on the minimization ofKL(q‖p)will tend
to approximate the distribution in the neighbourhood of one of the modes and ignore
the others. Again, because equivalent modes have equivalent predictive densities,
this is of no concern provided we are considering a model having a specific number
Kof components. If, however, we wish to compare different values ofK, then we
need to take account of this multimodality. A simple approximate solution is to add
Exercise 10.22 a termlnK!onto the lower bound when used for model comparison and averaging.
Figure 10.7 shows a plot of the lower bound, including the multimodality fac-
tor, versus the numberKof components for the Old Faithful data set. It is worth
emphasizing once again that maximum likelihood would lead to values of the likeli-
hood function that increase monotonically withK(assuming the singular solutions
have been avoided, and discounting the effects of local maxima) and so cannot be
used to determine an appropriate model complexity. By contrast, Bayesian inference
Section 3.4 automatically makes the trade-off between model complexity and fitting the data.
This approach to the determination ofKrequires that a range of models having
differentKvalues be trained and compared. An alternative approach to determining
a suitable value forKis to treat the mixing coefficientsπas parameters and make
point estimates of their values by maximizing the lower bound (Corduneanu and
Bishop, 2001) with respect toπinstead of maintaining a probability distribution
Exercise 10.23 over them as in the fully Bayesian approach. This leads to the re-estimation equation
πk=
1
N
∑N
n=1
rnk (10.83)
and this maximization is interleaved with the variational updates for theqdistribution
over the remaining parameters. Components that provide insufficient contribution