Pattern Recognition and Machine Learning

280 5. NEURAL NETWORKS

where the input-dependent variance is given by

σ^2 (x)=β−^1 +gTA−^1 g. (5.173)

We see that the predictive distributionp(t|x,D)is a Gaussian whose mean is given by the network functiony(x,wMAP)with the parameter set to their MAP value. The variance has two terms, the first of which arises from the intrinsic noise on the target variable, whereas the second is anx-dependent term that expresses the uncertainty in the interpolant due to the uncertainty in the model parametersw. This should be compared with the corresponding predictive distribution for the linear regression model, given by (3.58) and (3.59).

5.7.2 Hyperparameter optimization

So far, we have assumed that the hyperparametersαandβare fixed and known. We can make use of the evidence framework, discussed in Section 3.5, together with the Gaussian approximation to the posterior obtained using the Laplace approximation, to obtain a practical procedure for choosing the values of such hyperparameters. The marginal likelihood, or evidence, for the hyperparameters is obtained by integrating over the network weights

p(D|α, β)=

∫ p(D|w,β)p(w|α)dw. (5.174)

Exercise 5.39 This is easily evaluated by making use of the Laplace approximation result (4.135).
Taking logarithms then gives

lnp(D|α, β)−E(wMAP)−

1

2

ln|A|+

W

2

lnα+

N

2

lnβ−

N

2

ln(2π)(5.175)

whereWis the total number of parameters inw, and the regularized error function is defined by

E(wMAP)=

β 2

∑N

n=1

{y(xn,wMAP)−tn}^2 +

α 2

wTMAPwMAP. (5.176)

We see that this takes the same form as the corresponding result (3.86) for the linear regression model. In the evidence framework, we make point estimates forαandβby maximizing lnp(D|α, β). Consider first the maximization with respect toα, which can be done by analogy with the linear regression case discussed in Section 3.5.2. We first define the eigenvalue equation βHui=λiui (5.177) whereHis the Hessian matrix comprising the second derivatives of the sum-of- squares error function, evaluated atw=wMAP. By analogy with (3.92), we obtain

α=

γ wTMAPwMAP

(5.178)

Pattern Recognition and Machine Learning

280 5. NEURAL NETWORKS

5.7.2 Hyperparameter optimization

1

2

W

2

N

2

N

2

(5.178)

Get our desktop app

Company

Features

Documentation

Resources