6.3 EXTENDING LINEAR MODELS 227
Backpropagation
Suppose that we have some data and seek a multilayer perceptron that is an
accurate predictor for the underlying classification problem. Given a fixed
network structure, we must determine appropriate weights for the connections
in the network. In the absence of hidden layers, the perceptron learning rule
from Section 4.6 can be used to find suitable values. But suppose there are
hidden units. We know what the output unit should predict, and could adjust
the weights of the connections leading to that unit based on the perceptron rule.
But the correct outputs for the hidden units are unknown, so the rule cannot
be applied there.
It turns out that, roughly speaking, the solution is to modify the weights of
the connections leading to the hidden units based on the strength of each unit’s
contribution to the final prediction. There is a standard mathematical opti-
mization algorithm, called gradient descent,which achieves exactly that. Unfor-
tunately, it requires taking derivatives, and the step function that the simple
perceptron uses to convert the weighted sum of the inputs into a 0/1 prediction
is not differentiable. We need to see whether the step function can be replaced
with something else.
Figure 6.11(a) shows the step function: if the input is smaller than zero, it
outputs zero; otherwise, it outputs one. We want a function that is similar in
shape but differentiable. A commonly used replacement is shown in Figure
6.11(b). In neural networks terminology it is called the sigmoidfunction, and it
is defined by
We encountered it in Section 4.6 when we described the logit transform used
in logistic regression. In fact, learning a multilayer perceptron is closely related
to logistic regression.
To apply the gradient descent procedure, the error function—the thing that
is to be minimized by adjusting the weights—must also be differentiable. The
number of misclassifications—measured by the discrete 0–1 loss mentioned in
Section 5.6—does not fulfill this criterion. Instead, multilayer perceptrons are
usually trained by minimizing the squared error of the network’s output,
essentially treating it as an estimate of the class probability. (Other loss func-
tions are also applicable. For example, if the likelihood is used instead of
the squared error, learning a sigmoid-based perceptron is identical to logistic
regression.)
We work with the squared-error loss function because it is most widely used.
For a single training instance, it is
fx
e x
( )=
+ -
1
1
.