Pattern Recognition and Machine Learning

10.5. Local Variational Methods 497

Instead of thinking ofλas the variational parameter, we can letξplay this role as this leads to simpler expressions for the conjugate function, which is then given by

g(λ)=λ(ξ)ξ^2 −f(ξ)=λ(ξ)ξ^2 +ln(eξ/^2 +e−ξ/^2 ). (10.142)

Hence the bound onf(x)can be written as

f(x)λx^2 −g(λ)=λx^2 −λξ^2 −ln(eξ/^2 +e−ξ/^2 ). (10.143)

The bound on the sigmoid then becomes

σ(x)σ(ξ)exp

{ (x−ξ)/ 2 −λ(ξ)(x^2 −ξ^2 )

} (10.144)

whereλ(ξ)is defined by (10.141). This bound is illustrated in the right-hand plot of
Figure 10.12. We see that the bound has the form of the exponential of a quadratic
function ofx, which will prove useful when we seek Gaussian representations of
Section 4.5 posterior distributions defined through logistic sigmoid functions.
The logistic sigmoid arises frequently in probabilistic models over binary vari-
ables because it is the function that transforms a log odds ratio into a posterior prob-
ability. The corresponding transformation for a multiclass distribution is given by
Section 4.3 the softmax function. Unfortunately, the lower bound derived here for the logistic
sigmoid does not directly extend to the softmax. Gibbs (1997) proposes a method
for constructing a Gaussian distribution that is conjectured to be a bound (although
no rigorous proof is given), which may be used to apply local variational methods to
multiclass problems.
We shall see an example of the use of local variational bounds in Sections 10.6.1.
For the moment, however, it is instructive to consider in general terms how these
bounds can be used. Suppose we wish to evaluate an integral of the form

I=

∫ σ(a)p(a)da (10.145)

whereσ(a)is the logistic sigmoid, andp(a)is a Gaussian probability density. Such integrals arise in Bayesian models when, for instance, we wish to evaluate the pre- dictive distribution, in which casep(a)represents a posterior parameter distribution. Because the integral is intractable, we employ the variational bound (10.144), which we write in the formσ(a)f(a, ξ)whereξis a variational parameter. The integral now becomes the product of two exponential-quadratic functions and so can be integrated analytically to give a bound onI

I

∫ f(a, ξ)p(a)da=F(ξ). (10.146)

We now have the freedom to choose the variational parameterξ, which we do by finding the valueξthat maximizes the functionF(ξ). The resulting valueF(ξ) represents the tightest bound within this family of bounds and can be used as an approximation toI. This optimized bound, however, will in general not be exact.

Pattern Recognition and Machine Learning

I=

I

Get our desktop app

Company

Features

Documentation

Resources