4.3. Probabilistic Discriminative Models 2094.3.4 Multiclass logistic regression
Section 4.2 In our discussion of generative models for multiclass classification, we have
seen that for a large class of distributions, the posterior probabilities are given by a
softmax transformation of linear functions of the feature variables, so that
p(Ck|φ)=yk(φ)=exp(ak)
∑
jexp(aj)(4.104)
where the ‘activations’akare given byak=wkTφ. (4.105)There we used maximum likelihood to determine separately the class-conditional
densities and the class priors and then found the corresponding posterior probabilities
using Bayes’ theorem, thereby implicitly determining the parameters{wk}. Here we
consider the use of maximum likelihood to determine the parameters{wk}of this
model directly. To do this, we will require the derivatives ofykwith respect to all of
Exercise 4.17 the activationsaj. These are given by
∂yk
∂aj=yk(Ikj−yj) (4.106)whereIkjare the elements of the identity matrix.
Next we write down the likelihood function. This is most easily done using
the 1-of-Kcoding scheme in which the target vectortnfor a feature vectorφn
belonging to classCkis a binary vector with all elements zero except for elementk,
which equals one. The likelihood function is then given byp(T|w 1 ,...,wK)=∏Nn=1∏Kk=1p(Ck|φn)tnk=∏Nn=1∏Kk=1ytnknk (4.107)whereynk=yk(φn), andTis anN×Kmatrix of target variables with elements
tnk. Taking the negative logarithm then givesE(w 1 ,...,wK)=−lnp(T|w 1 ,...,wK)=−∑Nn=1∑Kk=1tnklnynk (4.108)which is known as thecross-entropyerror function for the multiclass classification
problem.
We now take the gradient of the error function with respect to one of the param-
eter vectorswj. Making use of the result (4.106) for the derivatives of the softmax
Exercise 4.18 function, we obtain
∇wjE(w 1 ,...,wK)=∑Nn=1(ynj−tnj)φn (4.109)