Pattern Recognition and Machine Learning

474 10. APPROXIMATE INFERENCE

of (10.35), and then subsequently determining theq(m)using (10.36). After normalization the resulting values forq(m)can be used for model selection or model averaging in the usual way.

10.2 Illustration: Variational Mixture of Gaussians

We now return to our discussion of the Gaussian mixture model and apply the variational inference machinery developed in the previous section. This will provide a good illustration of the application of variational methods and will also demonstrate how a Bayesian treatment elegantly resolves many of the difficulties associated with the maximum likelihood approach (Attias, 1999b). The reader is encouraged to work through this example in detail as it provides many insights into the practical application of variational methods. Many Bayesian models, corresponding to much more sophisticated distributions, can be solved by straightforward extensions and general- izations of this analysis. Our starting point is the likelihood function for the Gaussian mixture model, il- lustrated by the graphical model in Figure 9.6. For each observationxnwe have a corresponding latent variablezn comprising a 1-of-Kbinary vector with ele- mentsznkfork=1,...,K. As before we denote the observed data set byX= {x 1 ,...,xN}, and similarly we denote the latent variables byZ={z 1 ,...,zN}. From (9.10) we can write down the conditional distribution ofZ, given the mixing coefficientsπ, in the form

p(Z|π)=

∏N

n=1

∏K

k=1

πzknk. (10.37)

Similarly, from (9.11), we can write down the conditional distribution of the observed data vectors, given the latent variables and the component parameters

p(X|Z,μ,Λ)=

∏N

n=1

∏K

k=1

N

( xn|μk,Λ−k^1

)znk (10.38)

whereμ={μk}andΛ={Λk}. Note that we are working in terms of precision
matrices rather than covariance matrices as this somewhat simplifies the mathemat-
ics.
Next we introduce priors over the parametersμ,Λandπ. The analysis is con-
Section 10.4.1 siderably simplified if we use conjugate prior distributions. We therefore choose a
Dirichlet distribution over the mixing coefficientsπ

p(π)=Dir(π|α 0 )=C(α 0 )

∏K

k=1

παk^0 −^1 (10.39)

where by symmetry we have chosen the same parameterα 0 for each of the compo- nents, andC(α 0 )is the normalization constant for the Dirichlet distribution defined

Pattern Recognition and Machine Learning

474 10. APPROXIMATE INFERENCE

10.2 Illustration: Variational Mixture of Gaussians

N

Get our desktop app

Company

Features

Documentation

Resources