Pattern Recognition and Machine Learning

(Jeff_L) #1
280 5. NEURAL NETWORKS

where the input-dependent variance is given by

σ^2 (x)=β−^1 +gTA−^1 g. (5.173)

We see that the predictive distributionp(t|x,D)is a Gaussian whose mean is given
by the network functiony(x,wMAP)with the parameter set to their MAP value. The
variance has two terms, the first of which arises from the intrinsic noise on the target
variable, whereas the second is anx-dependent term that expresses the uncertainty
in the interpolant due to the uncertainty in the model parametersw. This should
be compared with the corresponding predictive distribution for the linear regression
model, given by (3.58) and (3.59).

5.7.2 Hyperparameter optimization


So far, we have assumed that the hyperparametersαandβare fixed and known.
We can make use of the evidence framework, discussed in Section 3.5, together with
the Gaussian approximation to the posterior obtained using the Laplace approxima-
tion, to obtain a practical procedure for choosing the values of such hyperparameters.
The marginal likelihood, or evidence, for the hyperparameters is obtained by
integrating over the network weights

p(D|α, β)=


p(D|w,β)p(w|α)dw. (5.174)

Exercise 5.39 This is easily evaluated by making use of the Laplace approximation result (4.135).
Taking logarithms then gives


lnp(D|α, β)−E(wMAP)−

1

2

ln|A|+

W

2

lnα+

N

2

lnβ−

N

2

ln(2π)(5.175)

whereWis the total number of parameters inw, and the regularized error function
is defined by

E(wMAP)=

β
2

∑N

n=1

{y(xn,wMAP)−tn}^2 +

α
2

wTMAPwMAP. (5.176)

We see that this takes the same form as the corresponding result (3.86) for the linear
regression model.
In the evidence framework, we make point estimates forαandβby maximizing
lnp(D|α, β). Consider first the maximization with respect toα, which can be done
by analogy with the linear regression case discussed in Section 3.5.2. We first define
the eigenvalue equation
βHui=λiui (5.177)
whereHis the Hessian matrix comprising the second derivatives of the sum-of-
squares error function, evaluated atw=wMAP. By analogy with (3.92), we obtain

α=

γ
wTMAPwMAP

(5.178)
Free download pdf