Pattern Recognition and Machine Learning

(Jeff_L) #1
278 5. NEURAL NETWORKS

to the posterior distribution (Hinton and van Camp, 1993) and also using a full-
covariance Gaussian (Barber and Bishop, 1998a; Barber and Bishop, 1998b). The
most complete treatment, however, has been based on the Laplace approximation
(MacKay, 1992c; MacKay, 1992b) and forms the basis for the discussion given here.
We will approximate the posterior distribution by a Gaussian, centred at a mode of
the true posterior. Furthermore, we shall assume that the covariance of this Gaus-
sian is small so that the network function is approximately linear with respect to the
parameters over the region of parameter space for which the posterior probability is
significantly nonzero. With these two approximations, we will obtain models that
are analogous to the linear regression and classification models discussed in earlier
chapters and so we can exploit the results obtained there. We can then make use of
the evidence framework to provide point estimates for the hyperparameters and to
compare alternative models (for example, networks having different numbers of hid-
den units). To start with, we shall discuss the regression case and then later consider
the modifications needed for solving classification tasks.

5.7.1 Posterior parameter distribution


Consider the problem of predicting a single continuous target variabletfrom
a vectorxof inputs (the extension to multiple targets is straightforward). We shall
suppose that the conditional distributionp(t|x)is Gaussian, with anx-dependent
mean given by the output of a neural network modely(x,w), and with precision
(inverse variance)β

p(t|x,w,β)=N(t|y(x,w),β−^1 ). (5.161)

Similarly, we shall choose a prior distribution over the weightswthat is Gaussian of
the form
p(w|α)=N(w| 0 ,α−^1 I). (5.162)
For an i.i.d. data set ofNobservationsx 1 ,...,xN, with a corresponding set of target
valuesD={t 1 ,...,tN}, the likelihood function is given by

p(D|w,β)=

∏N

n=1

N(tn|y(xn,w),β−^1 ) (5.163)

and so the resulting posterior distribution is then

p(w|D,α,β)∝p(w|α)p(D|w,β). (5.164)

which, as a consequence of the nonlinear dependence ofy(x,w)onw, will be non-
Gaussian.
We can find a Gaussian approximation to the posterior distribution by using the
Laplace approximation. To do this, we must first find a (local) maximum of the
posterior, and this must be done using iterative numerical optimization. As usual, it
is convenient to maximize the logarithm of the posterior, which can be written in the
Free download pdf