Pattern Recognition and Machine Learning

(Jeff_L) #1
5.2. Network Training 239

Figure 5.6 In the neighbourhood of a min-
imumw, the error function
can be approximated by a
quadratic. Contours of con-
stant error are then ellipses
whose axes are aligned with
the eigenvectorsuiof the Hes-
sian matrix, with lengths that
are inversely proportional to the
square roots of the correspond-
ing eigenvectorsλi.
w 1

w 2

λ− 11 /^2

λ− 21 /^2

u 1

w

u 2

Because the eigenvectors{ui}form a complete set, an arbitrary vectorvcan be
written in the form
v=


i

ciui. (5.38)

From (5.33) and (5.34), we then have

vTHv=


i

c^2 iλi (5.39)

Exercise 5.10 and soHwill be positive definite if, and only if, all of its eigenvalues are positive.
In the new coordinate system, whose basis vectors are given by the eigenvectors
Exercise 5.11 {ui}, the contours of constantEare ellipses centred on the origin, as illustrated
in Figure 5.6. For a one-dimensional weight space, a stationary pointwwill be a
minimum if
∂^2 E
∂w^2






w

> 0. (5.40)

The corresponding result inD-dimensions is that the Hessian matrix, evaluated at
Exercise 5.12 w, should be positive definite.


5.2.3 Use of gradient information


As we shall see in Section 5.3, it is possible to evaluate the gradient of an error
function efficiently by means of the backpropagation procedure. The use of this
gradient information can lead to significant improvements in the speed with which
the minima of the error function can be located. We can see why this is so, as follows.
In the quadratic approximation to the error function, given in (5.28), the error
surface is specified by the quantitiesbandH, which contain a total ofW(W+
Exercise 5.13 3)/ 2 independent elements (because the matrixHis symmetric), whereWis the
dimensionality ofw(i.e., the total number of adaptive parameters in the network).
The location of the minimum of this quadratic approximation therefore depends on
O(W^2 )parameters, and we should not expect to be able to locate the minimum until
we have gatheredO(W^2 )independent pieces of information. If we do not make
use of gradient information, we would expect to have to performO(W^2 )function

Free download pdf