Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

tions’ absolute error instead of the squared error used in linear regression. (There are, however, versions of the algorithm that use the squared error instead.) A user-specified parameter edefines a tube around the regression function in which errors are ignored: for linear support vector regression, the tube is a cylinder. If all training points can fit within a tube of width 2e, the algorithm outputs the function in the middle of the flattest tube that encloses them. In this case the total perceived error is zero. Figure 6.9(a) shows a regression problem with one attribute, a numeric class, and eight instances. In this case e was set to 1, so the width of the tube around the regression function (indicated by dotted lines) is 2. Figure 6.9(b) shows the outcome of the learning process when eis set to 2. As you can see, the wider tube makes it possible to learn a flatter function. The value ofecontrols how closely the function will fit the training data. Too large a value will produce a meaningless predictor—in the extreme case, when 2 eexceeds the range of class values in the training data, the regression line is horizontal and the algorithm just predicts the mean class value. On the other hand, for small values ofethere may be no tube that encloses all the data. In that case some training points will have nonzero error, and there will be a tradeoff between the prediction error and the tube’s flatness. In Figure 6.9(c),ewas set to 0.5 and there is no tube of width 1 that encloses all the data. For the linear case, the support vector regression function can be written

As with classification, the dot product can be replaced by a kernel function for nonlinear problems. The support vectors are all those points that do not fall strictly within the tube—that is, the points outside the tube and on its border. As with classification, all other points have coefficient 0 and can be deleted from the training data without changing the outcome of the learning process. In con- trast to the classification case, the aimay be negative. We have mentioned that as well as minimizing the error, the algorithm simul- taneously tries to maximize the flatness of the regression function. In Figure 6.9(a) and (b), where there is a tube that encloses all the training data, the algorithm simply outputs the flattest tube that does so. However, in Figure 6.9(c) there is no tube with error 0, and a tradeoff is struck between the prediction error and the tube’s flatness. This tradeoff is controlled by enforcing an upper limit Con the absolute value of the coefficients ai.The upper limit restricts the influence of the support vectors on the shape of the regression function and is a parameter that the user must specify in addition to e. The larger Cis, the more closely the function can fit the data. In the degenerate case e=0 the algorithm simply performs least-absolute-error regression under the coefficient size con-

xb i i i

=+Âaaa()◊ is support vector

.

220 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Get our desktop app

Company

Features

Documentation

Resources