##### 254 5. NEURAL NETWORKS

- Both weights in the first layer:

`∂^2 En`

∂w(1)ji∂wj(1)′i′

`=xixi′h′′(aj′)Ijj′`

`∑`

`k`

`w(2)kj′δk`

`+xixi′h′(aj′)h′(aj)`

`∑`

`k`

`∑`

`k′`

`w`

(2)

k′j′w

`(2)`

kjMkk′. (5.94)

- One weight in each layer:

`∂^2 En`

∂w

(1)

ji∂w

`(2)`

kj′

`=xih′(aj′)`

`{`

δkIjj′+zj

`∑`

`k′`

`w(2)k′j′Hkk′`

`}`

. (5.95)

HereIjj′is thej, j′element of the identity matrix. If one or both of the weights is

a bias term, then the corresponding expressions are obtained simply by setting the

Exercise 5.23 appropriate activation(s) to 1. Inclusion of skip-layer connections is straightforward.

#### 5.4.6 Fast multiplication by the Hessian

`For many applications of the Hessian, the quantity of interest is not the Hessian`

matrixHitself but the product ofHwith some vectorv. We have seen that the

evaluation of the Hessian takesO(W^2 )operations, and it also requires storage that is

O(W^2 ). The vectorvTHthat we wish to calculate, however, has onlyWelements,

so instead of computing the Hessian as an intermediate step, we can instead try to

find an efficient approach to evaluatingvTHdirectly in a way that requires only

O(W)operations.

To do this, we first note that

`vTH=vT∇(∇E) (5.96)`

`where∇denotes the gradient operator in weight space. We can then write down`

the standard forward-propagation and backpropagation equations for the evaluation

of∇Eand apply (5.96) to these equations to give a set of forward-propagation and

backpropagation equations for the evaluation ofvTH(Møller, 1993; Pearlmutter,

1994). This corresponds to acting on the original forward-propagation and back-

propagation equations with a differential operatorvT∇. Pearlmutter (1994) used the

notationR{·}to denote the operatorvT∇, and we shall follow this convention. The

analysis is straightforward and makes use of the usual rules of differential calculus,

together with the result

R{w}=v. (5.97)

The technique is best illustrated with a simple example, and again we choose a

two-layer network of the form shown in Figure 5.1, with linear output units and a

sum-of-squares error function. As before, we consider the contribution to the error

function from one pattern in the data set. The required vector is then obtained as