Pattern Recognition and Machine Learning

10.1. Variational Inference 467

optimal factorq 1 (z 1 ). In doing so it is useful to note that on the right-hand side we only need to retain those terms that have some functional dependence onz 1 because all other terms can be absorbed into the normalization constant. Thus we have

lnq 1 (z 1 )=Ez 2 [lnp(z)]+const

= Ez 2

[ −

1

2

(z 1 −μ 1 )^2 Λ 11 −(z 1 −μ 1 )Λ 12 (z 2 −μ 2 )

] +const

= −

1

2

z 12 Λ 11 +z 1 μ 1 Λ 11 −z 1 Λ 12 (E[z 2 ]−μ 2 ) + const. (10.11)

Next we observe that the right-hand side of this expression is a quadratic function of
z 1 , and so we can identifyq(z 1 )as a Gaussian distribution. It is worth emphasizing
that we did not assume thatq(zi)is Gaussian, but rather we derived this result by
variational optimization of the KL divergence over all possible distributionsq(zi).
Note also that we do not need to consider the additive constant in (10.9) explicitly
because it represents the normalization constant that can be found at the end by
Section 2.3.1 inspection if required. Using the technique of completing the square, we can identify
the mean and precision of this Gaussian, giving

q(z 1 )=N(z 1 |m 1 ,Λ− 111 ) (10.12)

where m 1 =μ 1 −Λ− 111 Λ 12 (E[z 2 ]−μ 2 ). (10.13) By symmetry,q 2 (z 2 )is also Gaussian and can be written as

q 2 (z 2 )=N(z 2 |m 2 ,Λ− 221 ) (10.14)

in which
m 2 =μ 2 −Λ− 221 Λ 21 (E[z 1 ]−μ 1 ). (10.15)
Note that these solutions are coupled, so thatq(z 1 )depends on expectations com-
puted with respect toq(z 2 )and vice versa. In general, we address this by treating
the variational solutions as re-estimation equations and cycling through the variables
in turn updating them until some convergence criterion is satisfied. We shall see
an example of this shortly. Here, however, we note that the problem is sufficiently
simple that a closed form solution can be found. In particular, becauseE[z 1 ]=m 1
andE[z 2 ]=m 2 , we see that the two equations are satisfied if we takeE[z 1 ]=μ 1
andE[z 2 ]=μ 2 , and it is easily shown that this is the only solution provided the dis-
Exercise 10.2 tribution is nonsingular. This result is illustrated in Figure 10.2(a). We see that the
mean is correctly captured but that the variance ofq(z)is controlled by the direction
of smallest variance ofp(z), and that the variance along the orthogonal direction is
significantly under-estimated. It is a general result that a factorized variational ap-
proximation tends to give approximations to the posterior distribution that are too
compact.
By way of comparison, suppose instead that we had been minimizing the reverse
Kullback-Leibler divergenceKL(p‖q). As we shall see, this form of KL divergence

Pattern Recognition and Machine Learning

1

2

= −

1

2

Get our desktop app

Company

Features

Documentation

Resources