Pattern Recognition and Machine Learning

470 10. APPROXIMATE INFERENCE

of divergences (Ali and Silvey, 1966; Amari, 1985; Minka, 2005) defined by

Dα(p‖q)=

4

1 −α^2

( 1 −

∫ p(x)(1+α)/^2 q(x)(1−α)/^2 dx

) (10.19)

where−∞<α<∞is a continuous parameter. The Kullback-Leibler divergence
KL(p‖q)corresponds to the limitα→ 1 , whereasKL(q‖p)corresponds to the limit
Exercise 10.6 α→− 1. For all values ofαwe haveDα(p‖q) 0 , with equality if, and only if,
p(x)=q(x). Supposep(x)is a fixed distribution, and we minimizeDα(p‖q)with
respect to some set of distributionsq(x). Then forα− 1 the divergence iszero
forcing, so that any values ofxfor whichp(x)=0will haveq(x)=0, and typically
q(x)will under-estimate the support ofp(x)and will tend to seek the mode with the
largest mass. Conversely forα 1 the divergence iszero-avoiding, so that values
ofxfor whichp(x)> 0 will haveq(x)> 0 , and typicallyq(x)will stretch to cover
all ofp(x), and will over-estimate the support ofp(x). Whenα=0we obtain a
symmetric divergence that is linearly related to theHellinger distancegiven by

DH(p‖q)=

∫ ( p(x)^1 /^2 −q(x)^1 /^2

) dx. (10.20)

The square root of the Hellinger distance is a valid distance metric.

10.1.3 Example: The univariate Gaussian

We now illustrate the factorized variational approximation using a Gaussian distribution over a single variablex(MacKay, 2003). Our goal is to infer the posterior distribution for the meanμand precisionτ, given a data setD={x 1 ,...,xN}of observed values ofxwhich are assumed to be drawn independently from the Gaus- sian. The likelihood function is given by

p(D|μ, τ)=

(τ

2 π

)N/ 2 exp

{ −

τ 2

∑N

n=1

(xn−μ)^2

}

. (10.21)

We now introduce conjugate prior distributions forμandτgiven by

p(μ|τ)=N

( μ|μ 0 ,(λ 0 τ)−^1

) (10.22) p(τ)=Gam(τ|a 0 ,b 0 ) (10.23)

whereGam(τ|a 0 ,b 0 )is the gamma distribution defined by (2.146). Together these
Section 2.3.6 distributions constitute a Gaussian-Gamma conjugate prior distribution.
For this simple problem the posterior distribution can be found exactly, and again
Exercise 2.44 takes the form of a Gaussian-gamma distribution. However, for tutorial purposes
we will consider a factorized variational approximation to the posterior distribution
given by
q(μ, τ)=qμ(μ)qτ(τ). (10.24)

Pattern Recognition and Machine Learning

470 10. APPROXIMATE INFERENCE

4

10.1.3 Example: The univariate Gaussian

Get our desktop app

Company

Features

Documentation

Resources