Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
tions’ absolute error instead of the squared error used in linear regression.
(There are, however, versions of the algorithm that use the squared error
instead.)
A user-specified parameter edefines a tube around the regression function
in which errors are ignored: for linear support vector regression, the tube is a
cylinder. If all training points can fit within a tube of width 2e, the algorithm
outputs the function in the middle of the flattest tube that encloses them. In
this case the total perceived error is zero. Figure 6.9(a) shows a regression
problem with one attribute, a numeric class, and eight instances. In this case e
was set to 1, so the width of the tube around the regression function (indicated
by dotted lines) is 2. Figure 6.9(b) shows the outcome of the learning process
when eis set to 2. As you can see, the wider tube makes it possible to learn a
flatter function.
The value ofecontrols how closely the function will fit the training data. Too
large a value will produce a meaningless predictor—in the extreme case, when
2 eexceeds the range of class values in the training data, the regression line is
horizontal and the algorithm just predicts the mean class value. On the other
hand, for small values ofethere may be no tube that encloses all the data. In
that case some training points will have nonzero error, and there will be a trade-
off between the prediction error and the tube’s flatness. In Figure 6.9(c),ewas
set to 0.5 and there is no tube of width 1 that encloses all the data.
For the linear case, the support vector regression function can be written

As with classification, the dot product can be replaced by a kernel function for
nonlinear problems. The support vectors are all those points that do not fall
strictly within the tube—that is, the points outside the tube and on its border.
As with classification, all other points have coefficient 0 and can be deleted from
the training data without changing the outcome of the learning process. In con-
trast to the classification case, the aimay be negative.
We have mentioned that as well as minimizing the error, the algorithm simul-
taneously tries to maximize the flatness of the regression function. In Figure
6.9(a) and (b), where there is a tube that encloses all the training data, the algo-
rithm simply outputs the flattest tube that does so. However, in Figure 6.9(c)
there is no tube with error 0, and a tradeoff is struck between the prediction
error and the tube’s flatness. This tradeoff is controlled by enforcing an upper
limit Con the absolute value of the coefficients ai.The upper limit restricts the
influence of the support vectors on the shape of the regression function and is
a parameter that the user must specify in addition to e. The larger Cis, the more
closely the function can fit the data. In the degenerate case e=0 the algorithm
simply performs least-absolute-error regression under the coefficient size con-

xb i i
i

=+Âaaa()◊
is support vector

.

220 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Free download pdf