5.7. Bayesian Neural Networks 283
Figure 5.22 Illustration of the evidence framework
applied to a synthetic two-class data set.
The green curve shows the optimal de-
cision boundary, the black curve shows
the result of fitting a two-layer network
with 8 hidden units by maximum likeli-
hood, and the red curve shows the re-
sult of including a regularizer in which
αis optimized using the evidence pro-
cedure, starting from the initial value
α=0. Note that the evidence proce-
dure greatly reduces the over-fitting of
the network.
−2 −1 0 1 2
−2
−1
0
1
2
3
simplest approximation is to assume that the posterior distribution is very narrow
and hence make the approximation
p(t|x,D)p(t|x,wMAP). (5.185)
We can improve on this, however, by taking account of the variance of the posterior
distribution. In this case, a linear approximation for the network outputs, as was used
in the case of regression, would be inappropriate due to the logistic sigmoid output-
unit activation function that constrains the output to lie in the range(0,1). Instead,
we make a linear approximation for the output unit activation in the form
a(x,w)aMAP(x)+bT(w−wMAP) (5.186)
whereaMAP(x)=a(x,wMAP), and the vectorb≡∇a(x,wMAP)can be found by
backpropagation.
Because we now have a Gaussian approximation for the posterior distribution
overw, and a model forathat is a linear function ofw, we can now appeal to the
results of Section 4.5.2. The distribution of output unit activation values, induced by
the distribution over network weights, is given by
p(a|x,D)=
∫
δ
(
a−aMAP(x)−bT(x)(w−wMAP)
)
q(w|D)dw (5.187)
whereq(w|D)is the Gaussian approximation to the posterior distribution given by
(5.167). From Section 4.5.2, we see that this distribution is Gaussian with mean
aMAP≡a(x,wMAP), and variance
σ^2 a(x)=bT(x)A−^1 b(x). (5.188)
Finally, to obtain the predictive distribution, we must marginalize overausing
p(t=1|x,D)=
∫
σ(a)p(a|x,D)da. (5.189)