Pattern Recognition and Machine Learning

5.5. Regularization in Neural Networks 257

M=1

0 1

−1

0

1 M=3

0 1

−1

0

1 M=10

0 1

−1

0

1

Figure 5.9 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. The
graphs show the result of fitting networks havingM=1, 3 and 10 hidden units, respectively, by minimizing a
sum-of-squares error function using a scaled conjugate-gradient algorithm.

of the form E ̃(w)=E(w)+λ 2

wTw. (5.112)

This regularizer is also known asweight decayand has been discussed at length in Chapter 3. The effective model complexity is then determined by the choice of the regularization coefficientλ. As we have seen previously, this regularizer can be interpreted as the negative logarithm of a zero-mean Gaussian prior distribution over the weight vectorw.

5.5.1 Consistent Gaussian priors

One of the limitations of simple weight decay in the form (5.112) is that is inconsistent with certain scaling properties of network mappings. To illustrate this, consider a multilayer perceptron network having two layers of weights and linear output units, which performs a mapping from a set of input variables{xi}to a set of output variables{yk}. The activations of the hidden units in the first hidden layer

Figure 5.10 Plot of the sum-of-squares test-set error for the polynomial data set ver- sus the number of hidden units in the network, with 30 random starts for each network size, showing the ef- fect of local minima. For each new start, the weight vector was initial- ized by sampling from an isotropic Gaussian distribution having a mean of zero and a variance of 10.

0 2 4 6 8 10

60

80

100

120

140

160

Pattern Recognition and Machine Learning

5.5.1 Consistent Gaussian priors

Get our desktop app

Company

Features

Documentation

Resources