Pattern Recognition and Machine Learning

(Jeff_L) #1
488 10. APPROXIMATE INFERENCE

where

mN = βSNΦTt (10.100)
SN =

(
E[α]I+βΦTΦ

)− 1

. (10.101)


Note the close similarity to the posterior distribution (3.52) obtained whenαwas
treated as a fixed parameter. The difference is that hereαis replaced by its expecta-
tionE[α]under the variational distribution. Indeed, we have chosen to use the same
notation for the covariance matrixSNin both cases.
Using the standard results (B.27), (B.38), and (B.39), we can obtain the required
moments as follows

E[α]=aN/bN (10.102)
E[wwT]=mNmTN+SN. (10.103)

The evaluation of the variational posterior distribution begins by initializing the pa-
rameters of one of the distributionsq(w)orq(α), and then alternately re-estimates
these factors in turn until a suitable convergence criterion is satisfied (usually speci-
fied in terms of the lower bound to be discussed shortly).
It is instructive to relate the variational solution to that found using the evidence
framework in Section 3.5. To do this consider the casea 0 =b 0 =0, corresponding
to the limit of an infinitely broad prior overα. The mean of the variational posterior
distributionq(α)is then given by

E[α]=

aN
bN

=

M/ 2

E[wTw]/ 2

=

M

mTNmN+Tr(SN)

. (10.104)

Comparison with (9.63) shows that in the case of this particularly simple model,
the variational approach gives precisely the same expression as that obtained by
maximizing the evidence function using EM except that the point estimate forα
is replaced by its expected value. Because the distributionq(w)depends onq(α)
only through the expectationE[α], we see that the two approaches will give identical
results for the case of an infinitely broad prior.

10.3.2 Predictive distribution


The predictive distribution overt, given a new inputx, is easily evaluated for
this model using the Gaussian variational posterior for the parameters

p(t|x,t)=


p(t|x,w)p(w|t)dw




p(t|x,w)q(w)dw

=


N(t|wTφ(x),β−^1 )N(w|mN,SN)dw

= N(t|mTNφ(x),σ^2 (x)) (10.105)
Free download pdf