6.3 EXTENDING LINEAR MODELS 233
minimum. It can be used for online learning, in which new data arrives in a
continuous stream and every training instance is processed just once. In both
variants of backpropagation, it is often helpful to standardize the attributes to
have zero mean and unit standard deviation. Before learning starts, each weight
is initialized to a small, randomly chosen value based on a normal distribution
with zero mean.
Like any other learning scheme, multilayer perceptrons trained with back-
propagation may suffer from overfitting—especially if the network is much
larger than what is actually necessary to represent the structure of the underly-
ing learning problem. Many modifications have been proposed to alleviate this.
A very simple one, called early stopping,works like reduced-error pruning in
rule learners: a holdout set is used to decide when to stop performing further
iterations of the backpropagation algorithm. The error on the holdout set is
measured and the algorithm is terminated once the error begins to increase,
because that indicates overfitting to the training data. Another method,
called weight decay,adds to the error function a penalty term that consists
of the squared sum of all weights in the network. This attempts to limit the
influence of irrelevant connections on the network’s predictions by penalizing
large weights that do not contribute a correspondingly large reduction in the
error.
Although standard gradient descent is the simplest technique for learning the
weights in a multilayer perceptron, it is by no means the most efficient one. In
practice, it tends to be rather slow. A trick that often improves performance is
to include a momentumterm when updating weights: add to the new weight
change a small proportion of the update value from the previous iteration. This
smooths the search process by making changes in direction less abrupt. More
sophisticated methods use information obtained from the second derivative of
the error function as well; they can converge much more quickly. However, even
those algorithms can be very slow compared with other methods of classifica-
tion learning.
A serious disadvantage of multilayer perceptrons that contain hidden units
is that they are essentially opaque. There are several techniques that attempt to
extract rules from trained neural networks. However, it is unclear whether they
offer any advantages over standard rule learners that induce rule sets directly
from data—especially considering that this can generally be done much more
quickly than learning a multilayer perceptron in the first place.
Although multilayer perceptrons are the most prominent type of neural
network, many others have been proposed. Multilayer perceptrons belong to a
class of networks called feedforward networksbecause they do not contain any
cycles and the network’s output depends only on the current input instance.
Recurrentneural networks do have cycles. Computations derived from earlier
input are fed back into the network, which gives them a kind of memory.
