Pattern Recognition and Machine Learning

208 4. LINEAR MODELS FOR CLASSIFICATION

where we have made use of (4.88). Also, we have introduced theN×Ndiagonal
matrixRwith elements
Rnn=yn(1−yn). (4.98)
We see that the Hessian is no longer constant but depends onwthrough the weight-
ing matrixR, corresponding to the fact that the error function is no longer quadratic.
Using the property 0 <yn< 1 , which follows from the form of the logistic sigmoid
function, we see thatuTHu> 0 for an arbitrary vectoru, and so the Hessian matrix
His positive definite. It follows that the error function is a concave function ofw
Exercise 4.15 and hence has a unique minimum.
The Newton-Raphson update formula for the logistic regression model then be-
comes

w(new) = w(old)−(ΦTRΦ)−^1 ΦT(y−t) =(ΦTRΦ)−^1

{ ΦTRΦw(old)−ΦT(y−t)

}

=(ΦTRΦ)−^1 ΦTRz (4.99)

wherezis anN-dimensional vector with elements

z=Φw(old)−R−^1 (y−t). (4.100)

We see that the update formula (4.99) takes the form of a set of normal equations for a weighted least-squares problem. Because the weighing matrixRis not constant but depends on the parameter vectorw, we must apply the normal equations iteratively, each time using the new weight vectorwto compute a revised weighing matrix R. For this reason, the algorithm is known asiterative reweighted least squares,or IRLS(Rubin, 1983). As in the weighted least-squares problem, the elements of the diagonal weighting matrixRcan be interpreted as variances because the mean and variance oftin the logistic regression model are given by

E[t]=σ(x)=y (4.101) var[t]=E[t^2 ]−E[t]^2 =σ(x)−σ(x)^2 =y(1−y) (4.102)

where we have used the propertyt^2 =tfort∈{ 0 , 1 }. In fact, we can interpret IRLS as the solution to a linearized problem in the space of the variablea=wTφ. The quantityzn, which corresponds to thenthelement ofz, can then be given a simple interpretation as an effective target value in this space obtained by making a local linear approximation to the logistic sigmoid function around the current operating pointw(old)

an(w) an(w(old))+

dan dyn

∣ ∣ ∣ ∣ w(old)

(tn−yn)

= φTnw(old)−

(yn−tn) yn(1−yn)

=zn. (4.103)

Pattern Recognition and Machine Learning

208 4. LINEAR MODELS FOR CLASSIFICATION

Get our desktop app

Company

Features

Documentation

Resources