Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

maximum margin hyperplane is relatively stable: it only moves if training instances are added or deleted that are support vectors—and this is true even in the high-dimensional space spanned by the nonlinear transformation. Over- fitting is caused by too much flexibility in the decision boundary. The support vectors are global representatives of the whole set of training points, and there are usually few of them, which gives little flexibility. Thus overfitting is unlikely to occur. What about computational complexity? This is still a problem. Suppose that the transformed space is a high-dimensional one so that the transformed support vectors and test instance have many components. According to the preceding equation, every time an instance is classified its dot product with all support vectors must be calculated. In the high-dimensional space produced by the nonlinear mapping this is rather expensive. Obtaining the dot product involves one multiplication and one addition for each attribute, and the number of attributes in the new space can be huge. This problem occurs not only during classification but also during training, because the optimization algorithms have to calculate the same dot products very frequently. Fortunately, it turns out that it is possible to calculate the dot product before the nonlinear mapping is performed, on the original attribute set. A high- dimensional version of the preceding equation is simply

where nis chosen as the number of factors in the transformation (three in the example we used earlier). If you expand the term (a(i)◊a)n,you will find that it contains all the high-dimensional terms that would have been involved if the test and training vectors were first transformed by including all products ofn factors and the dot product was taken of the result. (If you actually do the cal- culation, you will notice that some constant factors—binomial coefficients— are introduced. However, these do not matter: it is the dimensionality of the space that concerns us; the constants merely scale the axes.) Because of this mathematical equivalence, the dot products can be computed in the original low-dimensional space, and the problem becomes feasible. In implementation terms, you take a software package for constrained quadratic optimization and every time a(i)◊ais evaluated you evaluate (a(i)◊a)ninstead. It’s as simple as that, because in both the optimization and the classification algorithms these vectors are only ever used in this dot product form. The training vectors, including the support vectors, and the test instance all remain in the original low-dimensional space throughout the calculations. The function (x◊y)n,which computes the dot product of two vectors xand yand raises the result to the power n,is called a polynomial kernel.A good

xb=+Âaiiy i(aa()◊ )n,

218 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Get our desktop app

Company

Features

Documentation

Resources