10.2. Illustration: Variational Mixture of Gaussians 483
wherep(π,μ,Λ|X)is the (unknown) true posterior distribution of the parameters.
Using (10.37) and (10.38) we can first perform the summation over̂zto give
p(̂x|X)=
∑K
k=1
∫∫∫
πkN
(
̂x|μk,Λ−k^1
)
p(π,μ,Λ|X)dπdμdΛ. (10.79)
Because the remaining integrations are intractable, we approximate the predictive
density by replacing the true posterior distributionp(π,μ,Λ|X)with its variational
approximationq(π)q(μ,Λ)to give
p(̂x|X)=
∑K
k=1
∫∫∫
πkN
(
x̂|μk,Λ−k^1
)
q(π)q(μk,Λk)dπdμkdΛk (10.80)
where we have made use of the factorization (10.55) and in each term we have im-
plicitly integrated out all variables{μj,Λj}forj =kThe remaining integrations
Exercise 10.19 can now be evaluated analytically giving a mixture of Student’s t-distributions
p(x̂|X)=
1
̂α
∑K
k=1
αkSt(̂x|mk,Lk,νk+1−D) (10.81)
in which thekthcomponent has meanmk, and the precision is given by
Lk=
(νk+1−D)βk
(1 +βk)
Wk (10.82)
in whichνkis given by (10.63). When the sizeNof the data set is large the predictive
Exercise 10.20 distribution (10.81) reduces to a mixture of Gaussians.
10.2.4 Determining the number of components
We have seen that the variational lower bound can be used to determine a pos-
Section 10.1.4 terior distribution over the numberKof components in the mixture model. There
is, however, one subtlety that needs to be addressed. For any given setting of the
parameters in a Gaussian mixture model (except for specific degenerate settings),
there will exist other parameter settings for which the density over the observed vari-
ables will be identical. These parameter values differ only through a re-labelling of
the components. For instance, consider a mixture of two Gaussians and a single ob-
served variablex, in which the parameters have the valuesπ 1 =a,π 2 =b,μ 1 =c,
μ 2 =d,σ 1 =e,σ 2 =f. Then the parameter valuesπ 1 =b,π 2 =a,μ 1 =d,
μ 2 =c,σ 1 =f,σ 2 =e, in which the two components have been exchanged, will
by symmetry give rise to the same value ofp(x). If we have a mixture model com-
prisingKcomponents, then each parameter setting will be a member of a family of
Exercise 10.21 K!equivalent settings.
In the context of maximum likelihood, this redundancy is irrelevant because the
parameter optimization algorithm (for example EM) will, depending on the initial-
ization of the parameters, find one specific solution, and the other equivalent solu-
tions play no role. In a Bayesian setting, however, we marginalize over all possible