Pattern Recognition and Machine Learning

(Jeff_L) #1
238 5. NEURAL NETWORKS

where cubic and higher terms have been omitted. Herebis defined to be the gradient
ofEevaluated atŵ
b≡∇E|w=wb (5.29)
and the Hessian matrixH=∇∇Ehas elements

(H)ij≡

∂E

∂wi∂wj





w=wb

. (5.30)

From (5.28), the corresponding local approximation to the gradient is given by

∇Eb+H(w−ŵ). (5.31)

For pointswthat are sufficiently close toŵ, these expressions will give reasonable
approximations for the error and its gradient.
Consider the particular case of a local quadratic approximation around a point
wthat is a minimum of the error function. In this case there is no linear term,
because∇E=0atw, and (5.28) becomes

E(w)=E(w)+

1

2

(w−w)TH(w−w) (5.32)

where the HessianHis evaluated atw. In order to interpret this geometrically,
consider the eigenvalue equation for the Hessian matrix

Hui=λiui (5.33)

where the eigenvectorsuiform a complete orthonormal set (Appendix C) so that

uTiuj=δij. (5.34)

We now expand(w−w)as a linear combination of the eigenvectors in the form

w−w=


i

αiui. (5.35)

This can be regarded as a transformation of the coordinate system in which the origin
is translated to the pointw, and the axes are rotated to align with the eigenvectors
(through the orthogonal matrix whose columns are theui), and is discussed in more
detail in Appendix C. Substituting (5.35) into (5.32), and using (5.33) and (5.34),
allows the error function to be written in the form

E(w)=E(w)+

1

2


i

λiα^2 i. (5.36)

A matrixHis said to bepositive definiteif, and only if,

vTHv> 0 for allv. (5.37)
Free download pdf