maximum margin hyperplane is relatively stable: it only moves if training
instances are added or deleted that are support vectors—and this is true even
in the high-dimensional space spanned by the nonlinear transformation. Over-
fitting is caused by too much flexibility in the decision boundary. The support
vectors are global representatives of the whole set of training points, and there
are usually few of them, which gives little flexibility. Thus overfitting is unlikely
to occur.
What about computational complexity? This is still a problem. Suppose that
the transformed space is a high-dimensional one so that the transformed
support vectors and test instance have many components. According to the pre-
ceding equation, every time an instance is classified its dot product with all
support vectors must be calculated. In the high-dimensional space produced by
the nonlinear mapping this is rather expensive. Obtaining the dot product
involves one multiplication and one addition for each attribute, and the number
of attributes in the new space can be huge. This problem occurs not only during
classification but also during training, because the optimization algorithms have
to calculate the same dot products very frequently.
Fortunately, it turns out that it is possible to calculate the dot product before
the nonlinear mapping is performed, on the original attribute set. A high-
dimensional version of the preceding equation is simply
where nis chosen as the number of factors in the transformation (three in the
example we used earlier). If you expand the term (a(i)◊a)n,you will find that it
contains all the high-dimensional terms that would have been involved if the
test and training vectors were first transformed by including all products ofn
factors and the dot product was taken of the result. (If you actually do the cal-
culation, you will notice that some constant factors—binomial coefficients—
are introduced. However, these do not matter: it is the dimensionality of the
space that concerns us; the constants merely scale the axes.) Because of this
mathematical equivalence, the dot products can be computed in the original
low-dimensional space, and the problem becomes feasible. In implementation
terms, you take a software package for constrained quadratic optimization and
every time a(i)◊ais evaluated you evaluate (a(i)◊a)ninstead. It’s as simple as that,
because in both the optimization and the classification algorithms these vectors
are only ever used in this dot product form. The training vectors, including the
support vectors, and the test instance all remain in the original low-dimensional
space throughout the calculations.
The function (x◊y)n,which computes the dot product of two vectors xand
yand raises the result to the power n,is called a polynomial kernel.A good
xb=+Âaiiy i(aa()◊ )n,
218 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES