Pattern Recognition and Machine Learning

5.7. Bayesian Neural Networks 279

form

lnp(w|D)=−

α 2

wTw−

β 2

∑N

n=1

{y(xn,w)−tn}^2 +const (5.165)

which corresponds to a regularized sum-of-squares error function. Assuming for the moment thatαandβare fixed, we can find a maximum of the posterior, which we denotewMAP, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives. Having found a modewMAP, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribution. From (5.165), this is given by

A=−∇∇lnp(w|D,α,β)=αI+βH (5.166)

whereHis the Hessian matrix comprising the second derivatives of the sum-of- squares error function with respect to the components ofw. Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by

q(w|D)=N(w|wMAP,A−^1 ). (5.167)

Similarly, the predictive distribution is obtained by marginalizing with respect to this posterior distribution

p(t|x,D)=

∫ p(t|x,w)q(w|D)dw. (5.168)

However, even with the Gaussian approximation to the posterior, this integration is still analytically intractable due to the nonlinearity of the network functiony(x,w) as a function ofw. To make progress, we now assume that the posterior distribution has small variance compared with the characteristic scales ofwover whichy(x,w) is varying. This allows us to make a Taylor series expansion of the network function aroundwMAPand retain only the linear terms

y(x,w)y(x,wMAP)+gT(w−wMAP) (5.169)

where we have defined g=∇wy(x,w)|w=wMAP. (5.170) With this approximation, we now have a linear-Gaussian model with a Gaussian distribution forp(w)and a Gaussian forp(t|w)whose mean is a linear function of wof the form

p(t|x,w,β)N

( t|y(x,wMAP)+gT(w−wMAP),β−^1

)

. (5.171)

Exercise 5.38 We can therefore make use of the general result (2.115) for the marginalp(t)to give

p(t|x,D,α,β)=N

( t|y(x,wMAP),σ^2 (x)

) (5.172)

Pattern Recognition and Machine Learning

Get our desktop app

Company

Features

Documentation

Resources