10.1. Variational Inference 473
qμ(μ)in the form
E[μ]=x, E[μ^2 ]=x^2 +
1
NE[τ]
. (10.32)
Exercise 10.9 We can now substitute these moments into (10.31) and then solve forE[τ]to give
1
E[τ]
=
1
N− 1
(x^2 −x^2 )
=
1
N− 1
∑N
n=1
(xn−x)^2. (10.33)
We recognize the right-hand side as the familiar unbiased estimator for the variance
of a univariate Gaussian distribution, and so we see that the use of a Bayesian ap-
Section 1.2.4 proach has avoided the bias of the maximum likelihood solution.
10.1.4 Model comparison
As well as performing inference over the hidden variablesZ, we may also
wish to compare a set of candidate models, labelled by the indexm, and having
prior probabilitiesp(m). Our goal is then to approximate the posterior probabilities
p(m|X), whereXis the observed data. This is a slightly more complex situation
than that considered so far because different models may have different structure
and indeed different dimensionality for the hidden variablesZ. We cannot there-
fore simply consider a factorized approximationq(Z)q(m), but must instead recog-
nize that the posterior overZmust be conditioned onm, and so we must consider
q(Z,m)=q(Z|m)q(m). We can readily verify the following decomposition based
Exercise 10.10 on this variational distribution
lnp(X)=Lm−
∑
m
∑
Z
q(Z|m)q(m)ln
{
p(Z,m|X)
q(Z|m)q(m)
}
(10.34)
where theLmis a lower bound onlnp(X)and is given by
Lm=
∑
m
∑
Z
q(Z|m)q(m)ln
{
p(Z,X,m)
q(Z|m)q(m)
}
. (10.35)
Here we are assuming discreteZ, but the same analysis applies to continuous latent
variables provided the summations are replaced with integrations. We can maximize
Exercise 10.11 Lmwith respect to the distributionq(m)using a Lagrange multiplier, with the result
q(m)∝p(m)exp{Lm}. (10.36)
However, if we maximizeLmwith respect to theq(Z|m), we find that the solutions
for differentmare coupled, as we expect because they are conditioned onm.We
proceed instead by first optimizing each of theq(Z|m)individually by optimization