Pattern Recognition and Machine Learning

(Jeff_L) #1
474 10. APPROXIMATE INFERENCE

of (10.35), and then subsequently determining theq(m)using (10.36). After nor-
malization the resulting values forq(m)can be used for model selection or model
averaging in the usual way.

10.2 Illustration: Variational Mixture of Gaussians


We now return to our discussion of the Gaussian mixture model and apply the vari-
ational inference machinery developed in the previous section. This will provide a
good illustration of the application of variational methods and will also demonstrate
how a Bayesian treatment elegantly resolves many of the difficulties associated with
the maximum likelihood approach (Attias, 1999b). The reader is encouraged to work
through this example in detail as it provides many insights into the practical appli-
cation of variational methods. Many Bayesian models, corresponding to much more
sophisticated distributions, can be solved by straightforward extensions and general-
izations of this analysis.
Our starting point is the likelihood function for the Gaussian mixture model, il-
lustrated by the graphical model in Figure 9.6. For each observationxnwe have
a corresponding latent variablezn comprising a 1-of-Kbinary vector with ele-
mentsznkfork=1,...,K. As before we denote the observed data set byX=
{x 1 ,...,xN}, and similarly we denote the latent variables byZ={z 1 ,...,zN}.
From (9.10) we can write down the conditional distribution ofZ, given the mixing
coefficientsπ, in the form

p(Z|π)=

∏N

n=1

∏K

k=1

πzknk. (10.37)

Similarly, from (9.11), we can write down the conditional distribution of the ob-
served data vectors, given the latent variables and the component parameters

p(X|Z,μ,Λ)=

∏N

n=1

∏K

k=1

N

(
xn|μk,Λ−k^1

)znk
(10.38)

whereμ={μk}andΛ={Λk}. Note that we are working in terms of precision
matrices rather than covariance matrices as this somewhat simplifies the mathemat-
ics.
Next we introduce priors over the parametersμ,Λandπ. The analysis is con-
Section 10.4.1 siderably simplified if we use conjugate prior distributions. We therefore choose a
Dirichlet distribution over the mixing coefficientsπ


p(π)=Dir(π|α 0 )=C(α 0 )

∏K

k=1

παk^0 −^1 (10.39)

where by symmetry we have chosen the same parameterα 0 for each of the compo-
nents, andC(α 0 )is the normalization constant for the Dirichlet distribution defined
Free download pdf