5.7. Bayesian Neural Networks 279
form
lnp(w|D)=−
α
2
wTw−
β
2
∑N
n=1
{y(xn,w)−tn}^2 +const (5.165)
which corresponds to a regularized sum-of-squares error function. Assuming for
the moment thatαandβare fixed, we can find a maximum of the posterior, which
we denotewMAP, by standard nonlinear optimization algorithms such as conjugate
gradients, using error backpropagation to evaluate the required derivatives.
Having found a modewMAP, we can then build a local Gaussian approximation
by evaluating the matrix of second derivatives of the negative log posterior distribu-
tion. From (5.165), this is given by
A=−∇∇lnp(w|D,α,β)=αI+βH (5.166)
whereHis the Hessian matrix comprising the second derivatives of the sum-of-
squares error function with respect to the components ofw. Algorithms for comput-
ing and approximating the Hessian were discussed in Section 5.4. The corresponding
Gaussian approximation to the posterior is then given from (4.134) by
q(w|D)=N(w|wMAP,A−^1 ). (5.167)
Similarly, the predictive distribution is obtained by marginalizing with respect
to this posterior distribution
p(t|x,D)=
∫
p(t|x,w)q(w|D)dw. (5.168)
However, even with the Gaussian approximation to the posterior, this integration is
still analytically intractable due to the nonlinearity of the network functiony(x,w)
as a function ofw. To make progress, we now assume that the posterior distribution
has small variance compared with the characteristic scales ofwover whichy(x,w)
is varying. This allows us to make a Taylor series expansion of the network function
aroundwMAPand retain only the linear terms
y(x,w)y(x,wMAP)+gT(w−wMAP) (5.169)
where we have defined
g=∇wy(x,w)|w=wMAP. (5.170)
With this approximation, we now have a linear-Gaussian model with a Gaussian
distribution forp(w)and a Gaussian forp(t|w)whose mean is a linear function of
wof the form
p(t|x,w,β)N
(
t|y(x,wMAP)+gT(w−wMAP),β−^1
)
. (5.171)
Exercise 5.38 We can therefore make use of the general result (2.115) for the marginalp(t)to give
p(t|x,D,α,β)=N
(
t|y(x,wMAP),σ^2 (x)
)
(5.172)