`5.5. Regularization in Neural Networks 257`

`M=1`

`0 1`

`−1`

`0`

`1 M=3`

`0 1`

`−1`

`0`

`1 M=10`

`0 1`

`−1`

`0`

`1`

Figure 5.9 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. The

graphs show the result of fitting networks havingM=1, 3 and 10 hidden units, respectively, by minimizing a

sum-of-squares error function using a scaled conjugate-gradient algorithm.

`of the form`

E ̃(w)=E(w)+λ

2

`wTw. (5.112)`

`This regularizer is also known asweight decayand has been discussed at length`

in Chapter 3. The effective model complexity is then determined by the choice of

the regularization coefficientλ. As we have seen previously, this regularizer can be

interpreted as the negative logarithm of a zero-mean Gaussian prior distribution over

the weight vectorw.

#### 5.5.1 Consistent Gaussian priors

`One of the limitations of simple weight decay in the form (5.112) is that is`

inconsistent with certain scaling properties of network mappings. To illustrate this,

consider a multilayer perceptron network having two layers of weights and linear

output units, which performs a mapping from a set of input variables{xi}to a set

of output variables{yk}. The activations of the hidden units in the first hidden layer

`Figure 5.10 Plot of the sum-of-squares test-set`

error for the polynomial data set ver-

sus the number of hidden units in the

network, with 30 random starts for

each network size, showing the ef-

fect of local minima. For each new

start, the weight vector was initial-

ized by sampling from an isotropic

Gaussian distribution having a mean

of zero and a variance of 10.

`0 2 4 6 8 10`

`60`

`80`

`100`

`120`

`140`

`160`