470 10. APPROXIMATE INFERENCE
of divergences (Ali and Silvey, 1966; Amari, 1985; Minka, 2005) defined by
Dα(p‖q)=
4
1 −α^2
(
1 −
∫
p(x)(1+α)/^2 q(x)(1−α)/^2 dx
)
(10.19)
where−∞<α<∞is a continuous parameter. The Kullback-Leibler divergence
KL(p‖q)corresponds to the limitα→ 1 , whereasKL(q‖p)corresponds to the limit
Exercise 10.6 α→− 1. For all values ofαwe haveDα(p‖q) 0 , with equality if, and only if,
p(x)=q(x). Supposep(x)is a fixed distribution, and we minimizeDα(p‖q)with
respect to some set of distributionsq(x). Then forα− 1 the divergence iszero
forcing, so that any values ofxfor whichp(x)=0will haveq(x)=0, and typically
q(x)will under-estimate the support ofp(x)and will tend to seek the mode with the
largest mass. Conversely forα 1 the divergence iszero-avoiding, so that values
ofxfor whichp(x)> 0 will haveq(x)> 0 , and typicallyq(x)will stretch to cover
all ofp(x), and will over-estimate the support ofp(x). Whenα=0we obtain a
symmetric divergence that is linearly related to theHellinger distancegiven by
DH(p‖q)=
∫
(
p(x)^1 /^2 −q(x)^1 /^2
)
dx. (10.20)
The square root of the Hellinger distance is a valid distance metric.
10.1.3 Example: The univariate Gaussian
We now illustrate the factorized variational approximation using a Gaussian dis-
tribution over a single variablex(MacKay, 2003). Our goal is to infer the posterior
distribution for the meanμand precisionτ, given a data setD={x 1 ,...,xN}of
observed values ofxwhich are assumed to be drawn independently from the Gaus-
sian. The likelihood function is given by
p(D|μ, τ)=
(τ
2 π
)N/ 2
exp
{
−
τ
2
∑N
n=1
(xn−μ)^2
}
. (10.21)
We now introduce conjugate prior distributions forμandτgiven by
p(μ|τ)=N
(
μ|μ 0 ,(λ 0 τ)−^1
)
(10.22)
p(τ)=Gam(τ|a 0 ,b 0 ) (10.23)
whereGam(τ|a 0 ,b 0 )is the gamma distribution defined by (2.146). Together these
Section 2.3.6 distributions constitute a Gaussian-Gamma conjugate prior distribution.
For this simple problem the posterior distribution can be found exactly, and again
Exercise 2.44 takes the form of a Gaussian-gamma distribution. However, for tutorial purposes
we will consider a factorized variational approximation to the posterior distribution
given by
q(μ, τ)=qμ(μ)qτ(τ). (10.24)