Pattern Recognition and Machine Learning

254 5. NEURAL NETWORKS

Both weights in the first layer:

∂^2 En ∂w(1)ji∂wj(1)′i′

=xixi′h′′(aj′)Ijj′

∑

k

w(2)kj′δk

+xixi′h′(aj′)h′(aj)

∑

k

∑

k′

w (2) k′j′w

(2) kjMkk′. (5.94)

One weight in each layer:

∂^2 En ∂w (1) ji∂w

(2) kj′

=xih′(aj′)

{ δkIjj′+zj

∑

k′

w(2)k′j′Hkk′

}

. (5.95)

HereIjj′is thej, j′element of the identity matrix. If one or both of the weights is
a bias term, then the corresponding expressions are obtained simply by setting the
Exercise 5.23 appropriate activation(s) to 1. Inclusion of skip-layer connections is straightforward.

5.4.6 Fast multiplication by the Hessian

For many applications of the Hessian, the quantity of interest is not the Hessian matrixHitself but the product ofHwith some vectorv. We have seen that the evaluation of the Hessian takesO(W^2 )operations, and it also requires storage that is O(W^2 ). The vectorvTHthat we wish to calculate, however, has onlyWelements, so instead of computing the Hessian as an intermediate step, we can instead try to find an efficient approach to evaluatingvTHdirectly in a way that requires only O(W)operations. To do this, we first note that

vTH=vT∇(∇E) (5.96)

where∇denotes the gradient operator in weight space. We can then write down the standard forward-propagation and backpropagation equations for the evaluation of∇Eand apply (5.96) to these equations to give a set of forward-propagation and backpropagation equations for the evaluation ofvTH(Møller, 1993; Pearlmutter, 1994). This corresponds to acting on the original forward-propagation and backpropagation equations with a differential operatorvT∇. Pearlmutter (1994) used the notationR{·}to denote the operatorvT∇, and we shall follow this convention. The analysis is straightforward and makes use of the usual rules of differential calculus, together with the result R{w}=v. (5.97) The technique is best illustrated with a simple example, and again we choose a two-layer network of the form shown in Figure 5.1, with linear output units and a sum-of-squares error function. As before, we consider the contribution to the error function from one pattern in the data set. The required vector is then obtained as

Pattern Recognition and Machine Learning

254 5. NEURAL NETWORKS

5.4.6 Fast multiplication by the Hessian

Get our desktop app

Company

Features

Documentation

Resources