Pattern Recognition and Machine Learning

10.2. Illustration: Variational Mixture of Gaussians 483

wherep(π,μ,Λ|X)is the (unknown) true posterior distribution of the parameters. Using (10.37) and (10.38) we can first perform the summation over̂zto give

p(̂x|X)=

∑K

k=1

∫∫∫ πkN

( ̂x|μk,Λ−k^1

) p(π,μ,Λ|X)dπdμdΛ. (10.79)

Because the remaining integrations are intractable, we approximate the predictive density by replacing the true posterior distributionp(π,μ,Λ|X)with its variational approximationq(π)q(μ,Λ)to give

p(̂x|X)=

∑K

k=1

∫∫∫ πkN

( x̂|μk,Λ−k^1

) q(π)q(μk,Λk)dπdμkdΛk (10.80)

where we have made use of the factorization (10.55) and in each term we have im-
plicitly integrated out all variables{μj,Λj}forj =kThe remaining integrations
Exercise 10.19 can now be evaluated analytically giving a mixture of Student’s t-distributions

p(x̂|X)=

1

̂α

∑K

k=1

αkSt(̂x|mk,Lk,νk+1−D) (10.81)

in which thekthcomponent has meanmk, and the precision is given by

Lk=

(νk+1−D)βk (1 +βk)

Wk (10.82)

in whichνkis given by (10.63). When the sizeNof the data set is large the predictive
Exercise 10.20 distribution (10.81) reduces to a mixture of Gaussians.

10.2.4 Determining the number of components

We have seen that the variational lower bound can be used to determine a pos-
Section 10.1.4 terior distribution over the numberKof components in the mixture model. There
is, however, one subtlety that needs to be addressed. For any given setting of the
parameters in a Gaussian mixture model (except for specific degenerate settings),
there will exist other parameter settings for which the density over the observed vari-
ables will be identical. These parameter values differ only through a re-labelling of
the components. For instance, consider a mixture of two Gaussians and a single ob-
served variablex, in which the parameters have the valuesπ 1 =a,π 2 =b,μ 1 =c,
μ 2 =d,σ 1 =e,σ 2 =f. Then the parameter valuesπ 1 =b,π 2 =a,μ 1 =d,
μ 2 =c,σ 1 =f,σ 2 =e, in which the two components have been exchanged, will
by symmetry give rise to the same value ofp(x). If we have a mixture model com-
prisingKcomponents, then each parameter setting will be a member of a family of
Exercise 10.21 K!equivalent settings.
In the context of maximum likelihood, this redundancy is irrelevant because the
parameter optimization algorithm (for example EM) will, depending on the initial-
ization of the parameters, find one specific solution, and the other equivalent solu-
tions play no role. In a Bayesian setting, however, we marginalize over all possible

Pattern Recognition and Machine Learning

1

10.2.4 Determining the number of components

Get our desktop app

Company

Features

Documentation

Resources