Pattern Recognition and Machine Learning

(Jeff_L) #1
254 5. NEURAL NETWORKS


  1. Both weights in the first layer:


∂^2 En
∂w(1)ji∂wj(1)′i′

=xixi′h′′(aj′)Ijj′


k

w(2)kj′δk

+xixi′h′(aj′)h′(aj)


k


k′

w
(2)
k′j′w

(2)
kjMkk′. (5.94)


  1. One weight in each layer:


∂^2 En
∂w
(1)
ji∂w

(2)
kj′

=xih′(aj′)

{
δkIjj′+zj


k′

w(2)k′j′Hkk′

}

. (5.95)


HereIjj′is thej, j′element of the identity matrix. If one or both of the weights is
a bias term, then the corresponding expressions are obtained simply by setting the
Exercise 5.23 appropriate activation(s) to 1. Inclusion of skip-layer connections is straightforward.


5.4.6 Fast multiplication by the Hessian


For many applications of the Hessian, the quantity of interest is not the Hessian
matrixHitself but the product ofHwith some vectorv. We have seen that the
evaluation of the Hessian takesO(W^2 )operations, and it also requires storage that is
O(W^2 ). The vectorvTHthat we wish to calculate, however, has onlyWelements,
so instead of computing the Hessian as an intermediate step, we can instead try to
find an efficient approach to evaluatingvTHdirectly in a way that requires only
O(W)operations.
To do this, we first note that

vTH=vT∇(∇E) (5.96)

where∇denotes the gradient operator in weight space. We can then write down
the standard forward-propagation and backpropagation equations for the evaluation
of∇Eand apply (5.96) to these equations to give a set of forward-propagation and
backpropagation equations for the evaluation ofvTH(Møller, 1993; Pearlmutter,
1994). This corresponds to acting on the original forward-propagation and back-
propagation equations with a differential operatorvT∇. Pearlmutter (1994) used the
notationR{·}to denote the operatorvT∇, and we shall follow this convention. The
analysis is straightforward and makes use of the usual rules of differential calculus,
together with the result
R{w}=v. (5.97)
The technique is best illustrated with a simple example, and again we choose a
two-layer network of the form shown in Figure 5.1, with linear output units and a
sum-of-squares error function. As before, we consider the contribution to the error
function from one pattern in the data set. The required vector is then obtained as
Free download pdf