Pattern Recognition and Machine Learning

(Jeff_L) #1
5.7. Bayesian Neural Networks 279

form

lnp(w|D)=−

α
2

wTw−

β
2

∑N

n=1

{y(xn,w)−tn}^2 +const (5.165)

which corresponds to a regularized sum-of-squares error function. Assuming for
the moment thatαandβare fixed, we can find a maximum of the posterior, which
we denotewMAP, by standard nonlinear optimization algorithms such as conjugate
gradients, using error backpropagation to evaluate the required derivatives.
Having found a modewMAP, we can then build a local Gaussian approximation
by evaluating the matrix of second derivatives of the negative log posterior distribu-
tion. From (5.165), this is given by

A=−∇∇lnp(w|D,α,β)=αI+βH (5.166)

whereHis the Hessian matrix comprising the second derivatives of the sum-of-
squares error function with respect to the components ofw. Algorithms for comput-
ing and approximating the Hessian were discussed in Section 5.4. The corresponding
Gaussian approximation to the posterior is then given from (4.134) by

q(w|D)=N(w|wMAP,A−^1 ). (5.167)

Similarly, the predictive distribution is obtained by marginalizing with respect
to this posterior distribution

p(t|x,D)=


p(t|x,w)q(w|D)dw. (5.168)

However, even with the Gaussian approximation to the posterior, this integration is
still analytically intractable due to the nonlinearity of the network functiony(x,w)
as a function ofw. To make progress, we now assume that the posterior distribution
has small variance compared with the characteristic scales ofwover whichy(x,w)
is varying. This allows us to make a Taylor series expansion of the network function
aroundwMAPand retain only the linear terms

y(x,w)y(x,wMAP)+gT(w−wMAP) (5.169)

where we have defined
g=∇wy(x,w)|w=wMAP. (5.170)
With this approximation, we now have a linear-Gaussian model with a Gaussian
distribution forp(w)and a Gaussian forp(t|w)whose mean is a linear function of
wof the form

p(t|x,w,β)N

(
t|y(x,wMAP)+gT(w−wMAP),β−^1

)

. (5.171)


Exercise 5.38 We can therefore make use of the general result (2.115) for the marginalp(t)to give


p(t|x,D,α,β)=N

(
t|y(x,wMAP),σ^2 (x)

)
(5.172)
Free download pdf