Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
The learning rate determines the step size and hence how quickly the search
converges. If it is too large and the error function has several minima, the search
may overshoot and miss a minimum entirely, or it may oscillate wildly. If it is
too small, progress toward the minimum may be slow. Note that gradient
descent can only find a localminimum. If the function has several minima—
and error functions for multilayer perceptrons usually have many—it may not
find the best one. This is a significant drawback of standard multilayer percep-
trons compared with, for example, support vector machines.
To use gradient descent to find the weights of a multilayer perceptron, the
derivative of the squared error must be determined with respect to each param-
eter—that is, each weight in the network. Let’s start with a simple perceptron
without a hidden layer. Differentiating the preceding error function with respect
to a particular weight wiyields

Here,f(x) is the perceptron’s output and xis the weighted sum of the inputs.
To compute the second factor on the right-hand side, the derivative of the
sigmoid function f(x) is needed. It turns out that this has a particularly simple
form that can be written in terms off(x) itself:

We use f¢(x) to denote this derivative. But we seek the derivative with respect
to wi, not x.Because

the derivative off(x) with respect to wiis

Plugging this back into the derivative of the error function yields

This expression gives all that is needed to calculate the change of weight wi
caused by a particular example vector a(extended by 1 to represent the bias, as
explained previously). Having repeated this computation for each training
instance, we add up the changes associated with a particular weight wi,multi-
ply by the learning rate, and subtract the result from wi’s current value.

dE
dw

yfxfxa
i

=-( ( )) ¢( )i.

df x
dw

fxa
i

i

( )
=¢( ).

xwa=Âi ii,


df x
dx

fx fx

( )
= ( )(1.- ( ))

dE
dw

yfx

df x
iidw

=-( ( ))

( )
.

230 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Free download pdf