Pattern Recognition and Machine Learning

(Jeff_L) #1
282 5. NEURAL NETWORKS

this framework that arise when it is applied to classification. Here we shall con-
sider a network having a single logistic sigmoid output corresponding to a two-class
classification problem. The extension to networks with multiclass softmax outputs
Exercise 5.40 is straightforward. We shall build extensively on the analogous results for linear
classification models discussed in Section 4.5, and so we encourage the reader to
familiarize themselves with that material before studying this section.
The log likelihood function for this model is given by


lnp(D|w)=


n

=1N{tnlnyn+(1−tn)ln(1−yn)} (5.181)

wheretn∈{ 0 , 1 }are the target values, andyn≡y(xn,w). Note that there is no
hyperparameterβ, because the data points are assumed to be correctly labelled. As
before, the prior is taken to be an isotropic Gaussian of the form (5.162).
The first stage in applying the Laplace framework to this model is to initialize
the hyperparameterα, and then to determine the parameter vectorwby maximizing
the log posterior distribution. This is equivalent to minimizing the regularized error
function
E(w)=−lnp(D|w)+

α
2

wTw (5.182)

and can be achieved using error backpropagation combined with standard optimiza-
tion algorithms, as discussed in Section 5.3.
Having found a solutionwMAPfor the weight vector, the next step is to eval-
uate the Hessian matrixHcomprising the second derivatives of the negative log
likelihood function. This can be done, for instance, using the exact method of Sec-
tion 5.4.5, or using the outer product approximation given by (5.85). The second
derivatives of the negative log posterior can again be written in the form (5.166), and
the Gaussian approximation to the posterior is then given by (5.167).
To optimize the hyperparameterα, we again maximize the marginal likelihood,
Exercise 5.41 which is easily shown to take the form


lnp(D|α)−E(wMAP)−

1

2

ln|A|+

W

2

lnα+const (5.183)

where the regularized error function is defined by

E(wMAP)=−

∑N

n=1

{tnlnyn+(1−tn)ln(1−yn)}+

α
2

wTMAPwMAP (5.184)

in whichyn≡y(xn,wMAP). Maximizing this evidence function with respect toα
again leads to the re-estimation equation given by (5.178).
The use of the evidence procedure to determineαis illustrated in Figure 5.22
for the synthetic two-dimensional data discussed in Appendix A.
Finally, we need the predictive distribution, which is defined by (5.168). Again,
this integration is intractable due to the nonlinearity of the network function. The
Free download pdf