6.3 EXTENDING LINEAR MODELS 217
in the two-attribute case, where a 1 and a 2 are the attribute values, and there are
three weights wito be learned. However, the equation defining the maximum
margin hyperplane can be written in another form, in terms of the support
vectors. Write the class value yof a training instance as either 1 (for yes,it is
in this class) or -1 (for no,it is not). Then the maximum margin hyperplane
is
Here,yiis the class value of training instance a(i); while band aiare numeric
parameters that have to be determined by the learning algorithm. Note that a(i)
and aare vectors. The vector arepresents a test instance—just as the vector
[a 1 ,a 2 ] represented a test instance in the earlier formulation. The vectors a(i)
are the support vectors, those circled in Figure 6.8; they are selected members
of the training set. The term a(i)◊arepresents the dot product of the test instance
with one of the support vectors. If you are not familiar with dot product nota-
tion, you should still be able to understand the gist of what follows: just think
ofa(i)as the whole set of attribute values for the ith support vector. Finally,b
and aiare parameters that determine the hyperplane, just as the weights w 0 ,w 1 ,
and w 2 are parameters that determine the hyperplane in the earlier formulation.
It turns out that finding the support vectors for the instance sets and deter-
mining the parameters band aibelongs to a standard class of optimization
problems known as constrained quadratic optimization.There are off-the-shelf
software packages for solving these problems (see Fletcher 1987 for a com-
prehensive and practical account of solution methods). However, the com-
putational complexity can be reduced, and learning can be accelerated, if
special-purpose algorithms for training support vector machines are applied—
but the details of these algorithms lie beyond the scope of this book (Platt 1998).
Nonlinear class boundaries
We motivated the introduction of support vector machines by claiming that
they can be used to model nonlinear class boundaries. However, so far we have
only described the linear case. Consider what happens when an attribute trans-
formation, as described previously, is applied to the training data before deter-
mining the maximum margin hyperplane. Recall that there are two problems
with the straightforward application of such transformations to linear models:
infeasible computational complexity on the one hand and overfitting on the
other.
With support vectors, overfitting is unlikely to occur. The reason is that it is
inevitably associated with instability: changing one or two instance vectors will
make sweeping changes to large sections of the decision boundary. But the
xb iiyi
i
=+Âa aa()◊
is support vector
.