Pattern Recognition and Machine Learning

(Jeff_L) #1
10.2. Illustration: Variational Mixture of Gaussians 485

to explaining the data will have their mixing coefficients driven to zero during the
optimization, and so they are effectively removed from the model throughautomatic
relevance determination. This allows us to make a single training run in which we
start with a relatively large initial value ofK, and allow surplus components to be
pruned out of the model. The origins of the sparsity when optimizing with respect to
Section 7.2.2 hyperparameters is discussed in detail in the context of the relevance vector machine.


10.2.5 Induced factorizations


In deriving these variational update equations for the Gaussian mixture model,
we assumed a particular factorization of the variational posterior distribution given
by (10.42). However, the optimal solutions for the various factors exhibit additional
factorizations. In particular, the solution forq(μ,Λ)is given by the product of an
independent distributionq(μk,Λk)over each of the componentskof the mixture,
whereas the variational posterior distributionq(Z)over the latent variables, given
by (10.48), factorizes into an independent distributionq(zn)for each observationn
(note that it does not further factorize with respect tokbecause, for each value ofn,
theznkare constrained to sum to one overk). These additional factorizations are a
consequence of the interaction between the assumed factorization and the conditional
independence properties of the true distribution, as characterized by the directed
graph in Figure 10.5.
We shall refer to these additional factorizations asinduced factorizationsbe-
cause they arise from an interaction between the factorization assumed in the varia-
tional posterior distribution and the conditional independence properties of the true
joint distribution. In a numerical implementation of the variational approach it is
important to take account of such additional factorizations. For instance, it would
be very inefficient to maintain a full precision matrix for the Gaussian distribution
over a set of variables if the optimal form for that distribution always had a diago-
nal precision matrix (corresponding to a factorization with respect to the individual
variables described by that Gaussian).
Such induced factorizations can easily be detected using a simple graphical test
based on d-separation as follows. We partition the latent variables into three disjoint
groupsA,B,Cand then let us suppose that we are assuming a factorization between
Cand the remaining latent variables, so that

q(A,B,C)=q(A,B)q(C). (10.84)

Using the general result (10.9), together with the product rule for probabilities, we
see that the optimal solution forq(A,B)is given by

lnq(A,B)=EC[lnp(X,A,B,C)] + const
= EC[lnp(A,B|X,C)] + const. (10.85)

We now ask whether this resulting solution will factorize betweenAandB,in
other words whetherq(A,B)=q(A)q(B). This will happen if, and only if,
lnp(A,B|X,C)=lnp(A|X,C)+lnp(B|X,C), that is, if the conditional inde-
pendence relation
A⊥⊥B|X,C (10.86)
Free download pdf