Pattern Recognition and Machine Learning

238 5. NEURAL NETWORKS

where cubic and higher terms have been omitted. Herebis defined to be the gradient ofEevaluated atŵ b≡∇E|w=wb (5.29) and the Hessian matrixH=∇∇Ehas elements

(H)ij≡

∂E

∂wi∂wj

∣ ∣ ∣ ∣ w=wb

. (5.30)

From (5.28), the corresponding local approximation to the gradient is given by

∇Eb+H(w−ŵ). (5.31)

For pointswthat are sufficiently close toŵ, these expressions will give reasonable approximations for the error and its gradient. Consider the particular case of a local quadratic approximation around a point wthat is a minimum of the error function. In this case there is no linear term, because∇E=0atw, and (5.28) becomes

E(w)=E(w)+

1

2

(w−w)TH(w−w) (5.32)

where the HessianHis evaluated atw. In order to interpret this geometrically, consider the eigenvalue equation for the Hessian matrix

Hui=λiui (5.33)

where the eigenvectorsuiform a complete orthonormal set (Appendix C) so that

uTiuj=δij. (5.34)

We now expand(w−w)as a linear combination of the eigenvectors in the form

w−w=

∑

i

αiui. (5.35)

This can be regarded as a transformation of the coordinate system in which the origin is translated to the pointw, and the axes are rotated to align with the eigenvectors (through the orthogonal matrix whose columns are theui), and is discussed in more detail in Appendix C. Substituting (5.35) into (5.32), and using (5.33) and (5.34), allows the error function to be written in the form

E(w)=E(w)+

1

2

∑

i

λiα^2 i. (5.36)

A matrixHis said to bepositive definiteif, and only if,

vTHv> 0 for allv. (5.37)

Pattern Recognition and Machine Learning

238 5. NEURAL NETWORKS

∂E

. (5.30)

1

2

1

2

Get our desktop app

Company

Features

Documentation

Resources