Pattern Recognition and Machine Learning

(Jeff_L) #1
``5.5. Regularization in Neural Networks 257``

``M=1``

``0 1``

``−1``

``0``

``1 M=3``

``0 1``

``−1``

``0``

``1 M=10``

``0 1``

``−1``

``0``

``1``

Figure 5.9 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. The
graphs show the result of fitting networks havingM=1, 3 and 10 hidden units, respectively, by minimizing a
sum-of-squares error function using a scaled conjugate-gradient algorithm.

``````of the form
E ̃(w)=E(w)+λ
2``````

``wTw. (5.112)``

``````This regularizer is also known asweight decayand has been discussed at length
in Chapter 3. The effective model complexity is then determined by the choice of
the regularization coefficientλ. As we have seen previously, this regularizer can be
interpreted as the negative logarithm of a zero-mean Gaussian prior distribution over
the weight vectorw.``````

5.5.1 Consistent Gaussian priors

``````One of the limitations of simple weight decay in the form (5.112) is that is
inconsistent with certain scaling properties of network mappings. To illustrate this,
consider a multilayer perceptron network having two layers of weights and linear
output units, which performs a mapping from a set of input variables{xi}to a set
of output variables{yk}. The activations of the hidden units in the first hidden layer``````

``````Figure 5.10 Plot of the sum-of-squares test-set
error for the polynomial data set ver-
sus the number of hidden units in the
network, with 30 random starts for
each network size, showing the ef-
fect of local minima. For each new
start, the weight vector was initial-
ized by sampling from an isotropic
Gaussian distribution having a mean
of zero and a variance of 10.``````

``0 2 4 6 8 10``

``60``

``80``

``100``

``120``

``140``

``160``