Pattern Recognition and Machine Learning

(Jeff_L) #1
470 10. APPROXIMATE INFERENCE

of divergences (Ali and Silvey, 1966; Amari, 1985; Minka, 2005) defined by

Dα(p‖q)=

4

1 −α^2

(
1 −


p(x)(1+α)/^2 q(x)(1−α)/^2 dx

)
(10.19)

where−∞<α<∞is a continuous parameter. The Kullback-Leibler divergence
KL(p‖q)corresponds to the limitα→ 1 , whereasKL(q‖p)corresponds to the limit
Exercise 10.6 α→− 1. For all values ofαwe haveDα(p‖q) 0 , with equality if, and only if,
p(x)=q(x). Supposep(x)is a fixed distribution, and we minimizeDα(p‖q)with
respect to some set of distributionsq(x). Then forα− 1 the divergence iszero
forcing, so that any values ofxfor whichp(x)=0will haveq(x)=0, and typically
q(x)will under-estimate the support ofp(x)and will tend to seek the mode with the
largest mass. Conversely forα 1 the divergence iszero-avoiding, so that values
ofxfor whichp(x)> 0 will haveq(x)> 0 , and typicallyq(x)will stretch to cover
all ofp(x), and will over-estimate the support ofp(x). Whenα=0we obtain a
symmetric divergence that is linearly related to theHellinger distancegiven by


DH(p‖q)=


(
p(x)^1 /^2 −q(x)^1 /^2

)
dx. (10.20)

The square root of the Hellinger distance is a valid distance metric.

10.1.3 Example: The univariate Gaussian


We now illustrate the factorized variational approximation using a Gaussian dis-
tribution over a single variablex(MacKay, 2003). Our goal is to infer the posterior
distribution for the meanμand precisionτ, given a data setD={x 1 ,...,xN}of
observed values ofxwhich are assumed to be drawn independently from the Gaus-
sian. The likelihood function is given by

p(D|μ, τ)=


2 π

)N/ 2
exp

{

τ
2

∑N

n=1

(xn−μ)^2

}

. (10.21)


We now introduce conjugate prior distributions forμandτgiven by

p(μ|τ)=N

(
μ|μ 0 ,(λ 0 τ)−^1

)
(10.22)
p(τ)=Gam(τ|a 0 ,b 0 ) (10.23)

whereGam(τ|a 0 ,b 0 )is the gamma distribution defined by (2.146). Together these
Section 2.3.6 distributions constitute a Gaussian-Gamma conjugate prior distribution.
For this simple problem the posterior distribution can be found exactly, and again
Exercise 2.44 takes the form of a Gaussian-gamma distribution. However, for tutorial purposes
we will consider a factorized variational approximation to the posterior distribution
given by
q(μ, τ)=qμ(μ)qτ(τ). (10.24)

Free download pdf