Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

extreme case, when some attributes are completely irrelevant—because all attributes contribute equally to the distance formula. Fourth, it does not perform explicit generalization, although we intimated in Section 3.8 (and illus- trated in Figure 3.8) that some instance-based learning systems do indeed perform explicit generalization.

Reducing the number of exemplars

The plain nearest-neighbor rule stores a lot of redundant exemplars: it is almost always completely unnecessary to save all the examples seen so far. A simple variant is to classify each example with respect to the examples already seen and to save only ones that are misclassified. We use the term exemplarsto refer to the already-seen instances that are used for classification. Discarding correctly classified instances reduces the number of exemplars and proves to be an effective way to prune the exemplar database. Ideally, only a single exemplar is stored for each important region of the instance space. However, early in the learning process examples may be discarded that later turn out to be important, possi- bly leading to some decrease in predictive accuracy. As the number of stored instances increases, the accuracy of the model improves, and so the system makes fewer mistakes. Unfortunately, the strategy of only storing misclassified instances does not work well in the face of noise. Noisy examples are very likely to be misclassified, and so the set of stored exemplars tends to accumulate those that are least useful. This effect is easily observed experimentally. Thus this strategy is only a step- ping-stone on the way toward more effective instance-based learners.

Pruning noisy exemplars

Noisy exemplars inevitably lower the performance of any nearest-neighbor scheme that does not suppress them because they have the effect of repeatedly misclassifying new instances. There are two ways of dealing with this. One is to locate, instead of the single nearest neighbor, the knearest neighbors for some predetermined constant k and assign the majority class to the unknown instance. The only problem here is determining a suitable value ofk. Plain nearest-neighbor learning corresponds to k =1. The more noise, the greater the optimal value ofk. One way to proceed is to perform cross-validation tests with different values and choose the best. Although this is expensive in computation time, it often yields excellent predictive performance. A second solution is to monitor the performance of each exemplar that is stored and discard ones that do not perform well. This can be done by keeping a record of the number of correct and incorrect classification decisions that each exemplar makes. Two predetermined thresholds are set on the success ratio. When an exemplar’s performance drops below the lower one, it is deleted from

236 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Reducing the number of exemplars

Pruning noisy exemplars

Get our desktop app

Company

Features

Documentation

Resources