5.7. Bayesian Neural Networks 279formlnp(w|D)=−α
2wTw−β
2∑Nn=1{y(xn,w)−tn}^2 +const (5.165)which corresponds to a regularized sum-of-squares error function. Assuming for
the moment thatαandβare fixed, we can find a maximum of the posterior, which
we denotewMAP, by standard nonlinear optimization algorithms such as conjugate
gradients, using error backpropagation to evaluate the required derivatives.
Having found a modewMAP, we can then build a local Gaussian approximation
by evaluating the matrix of second derivatives of the negative log posterior distribu-
tion. From (5.165), this is given byA=−∇∇lnp(w|D,α,β)=αI+βH (5.166)whereHis the Hessian matrix comprising the second derivatives of the sum-of-
squares error function with respect to the components ofw. Algorithms for comput-
ing and approximating the Hessian were discussed in Section 5.4. The corresponding
Gaussian approximation to the posterior is then given from (4.134) byq(w|D)=N(w|wMAP,A−^1 ). (5.167)Similarly, the predictive distribution is obtained by marginalizing with respect
to this posterior distributionp(t|x,D)=∫
p(t|x,w)q(w|D)dw. (5.168)However, even with the Gaussian approximation to the posterior, this integration is
still analytically intractable due to the nonlinearity of the network functiony(x,w)
as a function ofw. To make progress, we now assume that the posterior distribution
has small variance compared with the characteristic scales ofwover whichy(x,w)
is varying. This allows us to make a Taylor series expansion of the network function
aroundwMAPand retain only the linear termsy(x,w)y(x,wMAP)+gT(w−wMAP) (5.169)where we have defined
g=∇wy(x,w)|w=wMAP. (5.170)
With this approximation, we now have a linear-Gaussian model with a Gaussian
distribution forp(w)and a Gaussian forp(t|w)whose mean is a linear function of
wof the formp(t|x,w,β)N(
t|y(x,wMAP)+gT(w−wMAP),β−^1). (5.171)
Exercise 5.38 We can therefore make use of the general result (2.115) for the marginalp(t)to give
p(t|x,D,α,β)=N(
t|y(x,wMAP),σ^2 (x))
(5.172)