6.3 EXTENDING LINEAR MODELS 215
constructed in the new space can represent a nonlinear decision boundary in
the original space.
Imagine applying this idea directly to the ordinary linear models in Section
4.6. For example, the original set of attributes could be replaced by one giving
all products ofnfactors that can be constructed from these attributes. An
example for two attributes, including all products with three factors, is
Here,xis the outcome,a 1 and a 2 are the two attribute values, and there are four
weights wito be learned. As described in Section 4.6, the result can be used for
classification by training one linear system for each class and assigning an
unknown instance to the class that gives the greatest output x—the standard
technique of multiresponse linear regression. Then,a 1 and a 2 will be the attrib-
ute values for the test instance. To generate a linear model in the space spanned
by these products, each training instance is mapped into the new space by
computing all possible three-factor products of its two attribute values. The
learning algorithm is then applied to the transformed instances. To classify an
instance, it is processed by the same transformation prior to classification. There
is nothing to stop us from adding in more synthetic attributes. For example, if
a constant term were included, the original attributes and all two-factor prod-
ucts of them would yield a total of eight weights to be learned. (Alternatively,
adding an additional attribute whose value was always a constant would have
the same effect.) Indeed, polynomials of sufficiently high degree can approxi-
mate arbitrary decision boundaries to any required accuracy.
It seems too good to be true—and it is. As you will probably have guessed,
problems arise with this procedure because of the large number of coefficients
introduced by the transformation in any realistic setting. The first snag is com-
putational complexity. With 10 attributes in the original dataset, suppose we
want to include all products with five factors: then the learning algorithm will
have to determine more than 2000 coefficients. If its run time is cubic in the
number of attributes, as it is for linear regression, training will be infeasible.
That is a problem of practicality. The second problem is one of principle: over-
fitting. If the number of coefficients is large relative to the number of training
instances, the resulting model will be “too nonlinear”—it will overfit the train-
ing data. There are just too many parameters in the model.
The maximum margin hyperplane
Support vector machines solve both problems. They are based on an algorithm
that finds a special kind of linear model: the maximum margin hyperplane.We
already know what a hyperplane is—it’s just another word for a linear model.
To visualize a maximum margin hyperplane, imagine a two-class dataset whose
xwawaawaawa=+ + + 113 212 2 3122 423.