Pattern Recognition and Machine Learning

(Jeff_L) #1
250 5. NEURAL NETWORKS

An important consideration for many applications of the Hessian is the efficiency
with which it can be evaluated. If there areWparameters (weights and biases) in the
network, then the Hessian matrix has dimensionsW×Wand so the computational
effort needed to evaluate the Hessian will scale likeO(W^2 )for each pattern in the
data set. As we shall see, there are efficient methods for evaluating the Hessian
whose scaling is indeedO(W^2 ).

5.4.1 Diagonal approximation


Some of the applications for the Hessian matrix discussed above require the
inverse of the Hessian, rather than the Hessian itself. For this reason, there has
been some interest in using a diagonal approximation to the Hessian, in other words
one that simply replaces the off-diagonal elements with zeros, because its inverse is
trivial to evaluate. Again, we shall consider an error function that consists of a sum
of terms, one for each pattern in the data set, so thatE=


nEn. The Hessian can
then be obtained by considering one pattern at a time, and then summing the results
over all patterns. From (5.48), the diagonal elements of the Hessian, for patternn,
can be written
∂^2 En
∂w^2 ji

=

∂^2 En
∂a^2 j

zi^2. (5.79)

Using (5.48) and (5.49), the second derivatives on the right-hand side of (5.79) can
be found recursively using the chain rule of differential calculus to give a backprop-
agation equation of the form

∂^2 En
∂a^2 j

=h′(aj)^2


k


k′

wkjwk′j

∂^2 En
∂ak∂ak′

+h′′(aj)


k

wkj

∂En
∂ak

. (5.80)

If we now neglect off-diagonal elements in the second-derivative terms, we obtain
(Becker and Le Cun, 1989; Le Cunet al., 1990)

∂^2 En
∂a^2 j

=h′(aj)^2


k

wkj^2

∂^2 En
∂a^2 k

+h′′(aj)


k

wkj

∂En
∂ak

. (5.81)

Note that the number of computational steps required to evaluate this approximation
isO(W), whereWis the total number of weight and bias parameters in the network,
compared withO(W^2 )for the full Hessian.
Ricottiet al.(1988) also used the diagonal approximation to the Hessian, but
they retained all terms in the evaluation of∂^2 En/∂a^2 jand so obtained exact expres-
sions for the diagonal terms. Note that this no longer hasO(W)scaling. The major
problem with diagonal approximations, however, is that in practice the Hessian is
typically found to be strongly nondiagonal, and so these approximations, which are
driven mainly be computational convenience, must be treated with care.
Free download pdf