Pattern Recognition and Machine Learning

(Jeff_L) #1
284 5. NEURAL NETWORKS

−2 −1 0 1 2

−2

−1

0

1

2

3

−2 −1 0 1 2

−2

−1

0

1

2

3

Figure 5.23 An illustration of the Laplace approximation for a Bayesian neural network having 8 hidden units
with ‘tanh’ activation functions and a single logistic-sigmoid output unit. The weight parameters were found using
scaled conjugate gradients, and the hyperparameterαwas optimized using the evidence framework. On the left
is the result of using the simple approximation (5.185) based on a point estimatewMAPof the parameters,
in which the green curve shows they=0. 5 decision boundary, and the other contours correspond to output
probabilities ofy=0. 1 , 0. 3 , 0. 7 , and 0. 9. On the right is the corresponding result obtained using (5.190). Note
that the effect of marginalization is to spread out the contours and to make the predictions less confident, so
that at each input pointx, the posterior probabilities are shifted towards 0. 5 , while they=0. 5 contour itself is
unaffected.


The convolution of a Gaussian with a logistic sigmoid is intractable. We therefore
apply the approximation (4.153) to (5.189) giving

p(t=1|x,D)=σ

(
κ(σ^2 a)bTwMAP

)
(5.190)

whereκ(·)is defined by (4.154). Recall that bothσ^2 aandbare functions ofx.
Figure 5.23 shows an example of this framework applied to the synthetic classi-
fication data set described in Appendix A.

Exercises


5.1 ( ) Consider a two-layer network function of the form (5.7) in which the hidden-
unit nonlinear activation functionsg(·)are given by logistic sigmoid functions of the
form
σ(a)={1+exp(−a)}−^1. (5.191)
Show that there exists an equivalent network, which computes exactly the same func-
tion, but with hidden unit activation functions given bytanh(a)where thetanhfunc-
tion is defined by (5.59). Hint: first find the relation betweenσ(a)andtanh(a), and
then show that the parameters of the two networks differ by linear transformations.

5.2 ( ) www Show that maximizing the likelihood function under the conditional
distribution (5.16) for a multioutput neural network is equivalent to minimizing the
sum-of-squares error function (5.11).
Free download pdf