Pattern Recognition and Machine Learning

250 5. NEURAL NETWORKS

An important consideration for many applications of the Hessian is the efficiency with which it can be evaluated. If there areWparameters (weights and biases) in the network, then the Hessian matrix has dimensionsW×Wand so the computational effort needed to evaluate the Hessian will scale likeO(W^2 )for each pattern in the data set. As we shall see, there are efficient methods for evaluating the Hessian whose scaling is indeedO(W^2 ).

5.4.1 Diagonal approximation

Some of the applications for the Hessian matrix discussed above require the inverse of the Hessian, rather than the Hessian itself. For this reason, there has been some interest in using a diagonal approximation to the Hessian, in other words one that simply replaces the off-diagonal elements with zeros, because its inverse is trivial to evaluate. Again, we shall consider an error function that consists of a sum of terms, one for each pattern in the data set, so thatE=

∑ nEn. The Hessian can then be obtained by considering one pattern at a time, and then summing the results over all patterns. From (5.48), the diagonal elements of the Hessian, for patternn, can be written ∂^2 En ∂w^2 ji

=

∂^2 En ∂a^2 j

zi^2. (5.79)

Using (5.48) and (5.49), the second derivatives on the right-hand side of (5.79) can be found recursively using the chain rule of differential calculus to give a backprop- agation equation of the form

∂^2 En ∂a^2 j

=h′(aj)^2

∑

k

∑

k′

wkjwk′j

∂^2 En ∂ak∂ak′

+h′′(aj)

∑

k

wkj

∂En ∂ak

. (5.80)

If we now neglect off-diagonal elements in the second-derivative terms, we obtain (Becker and Le Cun, 1989; Le Cunet al., 1990)

∂^2 En ∂a^2 j

=h′(aj)^2

∑

k

wkj^2

∂^2 En ∂a^2 k

+h′′(aj)

∑

k

wkj

∂En ∂ak

. (5.81)

Note that the number of computational steps required to evaluate this approximation isO(W), whereWis the total number of weight and bias parameters in the network, compared withO(W^2 )for the full Hessian. Ricottiet al.(1988) also used the diagonal approximation to the Hessian, but they retained all terms in the evaluation of∂^2 En/∂a^2 jand so obtained exact expres- sions for the diagonal terms. Note that this no longer hasO(W)scaling. The major problem with diagonal approximations, however, is that in practice the Hessian is typically found to be strongly nondiagonal, and so these approximations, which are driven mainly be computational convenience, must be treated with care.

Pattern Recognition and Machine Learning

250 5. NEURAL NETWORKS

5.4.1 Diagonal approximation

=

. (5.80)

. (5.81)

Get our desktop app

Company

Features

Documentation

Resources