4.6 LINEAR MODELS 123
where the x(i)are either zero or one.
The weights wineed to be chosen to maximize the log-likelihood. There are
several methods for solving this maximization problem. A simple one is to
iteratively solve a sequence of weighted least-squares regression problems until
the log-likelihood converges to a maximum, which usually happens in a few
iterations.
To generalize logistic regression to several classes, one possibility is to proceed
in the way described previously for multiresponse linear regression by per-
forming logistic regression independently for each class. Unfortunately, the
resulting probability estimates will not sum to one. To obtain proper probabil-
ities it is necessary to couple the individual models for each class. This yields a
joint optimization problem, and there are efficient solution methods for this.
A conceptually simpler, and very general, way to address multiclass problems
is known as pairwise classification.Here a classifier is built for every pair of
classes, using only the instances from these two classes. The output on an
unknown test example is based on which class receives the most votes. This
method generally yields accurate results in terms of classification error. It can
also be used to produce probability estimates by applying a method called pair-
wise coupling,which calibrates the individual probability estimates from the dif-
ferent classifiers.
If there are kclasses, pairwise classification builds a total ofk(k-1)/2 clas-
sifiers. Although this sounds unnecessarily computation intensive, it is not. In
fact, if the classes are evenly populated pairwise classification is at least as fast
as any other multiclass method. The reason is that each of the pairwise learn-
ing problem only involves instances pertaining to the two classes under consid-
eration. Ifninstances are divided evenly among kclasses, this amounts to 2n/k
instances per problem. Suppose the learning algorithm for a two-class problem
with ninstances takes time proportional to nseconds to execute. Then the run
time for pairwise classification is proportional to k(k-1)/2 ¥ 2 n/kseconds,
which is (k-1)n.In other words, the method scales linearly with the number
of classes. If the learning algorithm takes more time—say proportional to n^2 —
the advantage of the pairwise approach becomes even more pronounced.
The use of linear functions for classification can easily be visualized in
instance space. The decision boundary for two-class logistic regression lies
where the prediction probability is 0.5, that is:
This occurs when
Pr 1[]aa 12 , ,...,akkk=+-- --1 1( exp( w wa 0 11 ... wa))=0 5..
111 1
1
(- ) ( - [] 12 )+ ( [] 12 )
()
=
() () () () () () ()
 x aaax aaa
i
i
n ii
k
ii ii
k
log Pr , ,... , logPr , ,... , i