6.4 INSTANCE-BASED LEARNING 235
Discussion
Support vector machines originated from research in statistical learning theory
(Vapnik 1999), and a good starting point for exploration is a tutorial by Burges
(1998). A general description, including generalization to the case in which the
data is not linearly separable, has been published by Cortes and Vapnik (1995).
We have introduced the standard version of support vector regression:
Schölkopf et al. (1999) present a different version that has one parameter instead
of two. Smola and Schölkopf (2004) provide an extensive tutorial on support
vector regression.
The (voted) kernel perceptron is due to Freund and Schapire (1999). Cris-
tianini and Shawe-Taylor (2000) provide a nice introduction to support vector
machines and other kernel-based methods, including the optimization theory
underlying the support vector learning algorithms. We have barely skimmed the
surface of these learning schemes, mainly because advanced mathematics lies
just beneath. The idea of using kernels to solve nonlinear problems has been
applied to many algorithms, for example, principal components analysis
(described in Section 7.3). A kernel is essentially a similarity function with
certain mathematical properties, and it is possible to define kernel functions
over all sorts of structures—for example, sets, strings, trees, and probability
distributions. Shawe-Taylor and Cristianini (2004) cover kernel-based learning
in detail.
There is extensive literature on neural networks, and Bishop (1995) provides
an excellent introduction to both multilayer perceptrons and RBF networks.
Interest in neural networks appears to have declined since the arrival of support
vector machines, perhaps because the latter generally require fewer parameters
to be tuned to achieve the same (or greater) accuracy. However, multilayer per-
ceptrons have the advantage that they can learn to ignore irrelevant attributes,
and RBF networks trained using k-means can be viewed as a quick-and-dirty
method for finding a nonlinear classifier.
6.4 Instance-based learning
In Section 4.7 we saw how the nearest-neighbor rule can be used to implement
a basic form of instance-based learning. There are several practical problems
with this simple method. First, it tends to be slow for large training sets, because
the entire set must be searched for each test instance—unless sophisticated data
structures such as kD-trees or ball trees are used. Second, it performs badly with
noisy data, because the class of a test instance is determined by its single nearest
neighbor without any “averaging” to help to eliminate noise. Third, it performs
badly when different attributes affect the outcome to different extents—in the
P088407-Ch006.qxd 5/3/05 5:42 PM Page 235