6.3 EXTENDING LINEAR MODELS 219
way of choosing the value ofnis to start with 1 (a linear model) and incre-
ment it until the estimated error ceases to improve. Usually, quite small values
suffice.
Other kernel functions can be used instead to implement different nonlinear
mappings. Two that are often suggested are the radial basis function (RBF) kernel
and the sigmoid kernel.Which one produces the best results depends on the
application, although the differences are rarely large in practice. It is interesting
to note that a support vector machine with the RBF kernel is simply a type of
neural network called an RBF network(which we describe later), and one with
the sigmoid kernel implements another type of neural network, a multilayer
perceptron with one hidden layer (also described later).
Throughout this section, we have assumed that the training data is linearly
separable—either in the instance space or in the new space spanned by the non-
linear mapping. It turns out that support vector machines can be generalized to
the case where the training data is not separable. This is accomplished by placing
an upper bound on the preceding coefficients ai.Unfortunately, this parameter
must be chosen by the user, and the best setting can only be determined by
experimentation. Also, in all but trivial cases, it is not possible to determine a
priori whether the data is linearly separable or not.
Finally, we should mention that compared with other methods such as deci-
sion tree learners, even the fastest training algorithms for support vector
machines are slow when applied in the nonlinear setting. On the other hand,
they often produce very accurate classifiers because subtle and complex deci-
sion boundaries can be obtained.
Support vector regression
The concept of a maximum margin hyperplane only applies to classification.
However, support vector machine algorithms have been developed for numeric
prediction that share many of the properties encountered in the classification
case: they produce a model that can usually be expressed in terms of a few
support vectors and can be applied to nonlinear problems using kernel func-
tions. As with regular support vector machines, we will describe the concepts
involved but do not attempt to describe the algorithms that actually perform the
work.
As with linear regression, covered in Section 4.6, the basic idea is to find a
function that approximates the training points well by minimizing the predic-
tion error. The crucial difference is that all deviations up to a user-specified
parameter eare simply discarded. Also, when minimizing the error, the risk of
overfitting is reduced by simultaneously trying to maximize the flatness of the
function. Another difference is that what is minimized is normally the predic-