# Pattern Recognition and Machine Learning

(Jeff_L) #1
##### 238 5. NEURAL NETWORKS

``````where cubic and higher terms have been omitted. Herebis defined to be the gradient
ofEevaluated atŵ
b≡∇E|w=wb (5.29)
and the Hessian matrixH=∇∇Ehas elements``````

``(H)ij≡``

##### ∂E

``∂wi∂wj``

``````∣
∣
∣
∣
w=wb``````

##### . (5.30)

``From (5.28), the corresponding local approximation to the gradient is given by``

``∇Eb+H(w−ŵ). (5.31)``

``````For pointswthat are sufficiently close toŵ, these expressions will give reasonable
approximations for the error and its gradient.
Consider the particular case of a local quadratic approximation around a point
wthat is a minimum of the error function. In this case there is no linear term,
because∇E=0atw, and (5.28) becomes``````

``E(w)=E(w)+``

##### 2

``(w−w)TH(w−w) (5.32)``

``````where the HessianHis evaluated atw. In order to interpret this geometrically,
consider the eigenvalue equation for the Hessian matrix``````

``Hui=λiui (5.33)``

``where the eigenvectorsuiform a complete orthonormal set (Appendix C) so that``

``uTiuj=δij. (5.34)``

``We now expand(w−w)as a linear combination of the eigenvectors in the form``

``w−w=``

``∑``

``i``

``αiui. (5.35)``

``````This can be regarded as a transformation of the coordinate system in which the origin
is translated to the pointw, and the axes are rotated to align with the eigenvectors
(through the orthogonal matrix whose columns are theui), and is discussed in more
detail in Appendix C. Substituting (5.35) into (5.32), and using (5.33) and (5.34),
allows the error function to be written in the form``````

``E(w)=E(w)+``

##### 2

``∑``

``i``

``λiα^2 i. (5.36)``

``A matrixHis said to bepositive definiteif, and only if,``

``vTHv> 0 for allv. (5.37)``