Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
Furthermore,

This means that we are finished. Putting everything together yields an equation
for the derivative of the error function with respect to the weights wij:

As before, we calculate this value for every training instance, add up the changes
associated with a particular weight wij, multiply by the learning rate, and sub-
tract the outcome from the current value ofwij.
This derivation applies to a perceptron with one hidden layer. If there are two
hidden layers, the same strategy can be applied a second time to update the
weights pertaining to the input connections of the first hidden layer, propagat-
ing the error from the output unit through the second hidden layer to the first
one. Because of this error propagation mechanism, this version of the generic
gradient descent strategy is called backpropagation.
We have tacitly assumed that the network’s output layer has just one unit,
which is appropriate for two-class problems. For more than two classes, a sep-
arate network could be learned for each class that distinguishes it from the
remaining classes. A more compact classifier can be obtained from a single
network by creating an output unit for each class, connecting every unit in the
hidden layer to every output unit. The squared error for a particular training
instance is the sum of squared errors taken over all output units. The same tech-
nique can be applied to predict several targets, or attribute values, simultane-
ously by creating a separate output unit for each one. Intuitively, this may give
better predictive accuracy than building a separate classifier for each class attrib-
ute if the underlying learning tasks are in some way related.
We have assumed that weights are only updated after all training instances
have been fed through the network and all the corresponding weight changes
have been accumulated. This is batchlearning, because all the training data is
processed together. But exactly the same formulas can be used to update the
weights incrementally after each training instance has been processed. This is
called stochastic backpropagationbecause the overall error does not necessarily
decrease after every update and there is no guarantee that it will converge to a

dE
dw

yfxfxwfxa
ij

=-( ( )) ¢( ) iii¢( ).

df x
dw

fx

dx
dw

i fxa
ij

i

i
ij

ii

( )
=¢( ) =¢( ).

dx
dw

w

df x
ij i dw

i
ij

=

( )
.

xwfx=Âi ii( ),


232 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Free download pdf