a linear threshold unitas a binary test of whether a linear function is greater or
less than zero and a linear machineas a set of linear functions, one for each class,
whose value for an unknown example is compared and the largest chosen as its
predicted class. In the distant past, perceptrons fell out of favor on publication
of an influential book that showed they had fundamental limitations (Minsky
and Papert 1969); however, more complex systems of linear functions have
enjoyed a resurgence in recent years in the form of neural networks, described
in Section 6.3. The Winnow algorithms were introduced by Nick Littlestone in
his PhD thesis in 1989 (Littlestone 1988, 1989). Multiresponse linear classifiers
have found a new application recently for an operation called stackingthat com-
bines the output of other learning algorithms, described in Chapter 7 (see
Wolpert 1992). Friedman (1996) describes the technique of pairwise classifica-
tion, Fürnkranz (2002) further analyzes it, and Hastie and Tibshirani (1998)
extend it to estimate probabilities using pairwise coupling.
Fix and Hodges (1951) performed the first analysis of the nearest-neighbor
method, and Johns (1961) pioneered its use in classification problems. Cover
and Hart (1967) obtained the classic theoretical result that, for large enough
datasets, its probability of error never exceeds twice the theoretical minimum;
Devroye et al. (1996) showed that k-nearest neighbor is asymptotically optimal
for large kand nwithk/nÆ0. Nearest-neighbor methods gained popularity in
machine learning through the work of Aha (1992), who showed that instance-
based learning can be combined with noisy exemplar pruning and attribute
weighting and that the resulting methods perform well in comparison with
other learning methods. We take this up again in Chapter 6.
The kD-tree data structure was developed by Friedman et al. (1977). Our
description closely follows an explanation given by Andrew Moore in his PhD
thesis (Moore 1991), who, along with Omohundro (1987), pioneered its use in
machine learning. Moore (2000) describes sophisticated ways of constructing
ball trees that perform well even with thousands of attributes. We took our ball
tree example from lecture notes by Alexander Gray of Carnegie-Mellon Uni-
versity. The voting feature intervals method mentioned in the Discussion sub-
section at the end of Section 4.7 is described by Demiroz and Guvenir (1997).
The k-means algorithm is a classic technique, and many descriptions and
variations are available (e.g., see Hartigan 1975). The clever use ofkD-trees to
speed up k-means clustering, which we chose to illustrate using ball trees
instead, was pioneered by Moore and Pelleg (2000) in their X-means clustering
algorithm. That algorithm also contains some other innovations, described in
Section 6.6.
142 CHAPTER 4| ALGORITHMS: THE BASIC METHODS