Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

4.7 INSTANCE-BASED LEARNING 135


Choose the point in the ball that is farthest from its center, and then a second
point that is farthest from the first one. Assign all data points in the ball to the
closest one of these two cluster centers, then compute the centroid of each
cluster and the minimum radius required for it to enclose all the data points it
represents. This method has the merit that the cost of splitting a ball contain-
ing npoints is only linear in n. There are more elaborate algorithms that
produce tighter balls, but they require more computation. We will not describe
sophisticated algorithms for constructing ball trees or updating them incre-
mentally as new training instances are encountered.


Discussion


Nearest-neighbor instance-based learning is simple and often works very
well. In the method described previously each attribute has exactly the same
influence on the decision, just as it does in the Naïve Bayes method. Another
problem is that the database can easily become corrupted by noisy exemplars.
One solution is to adopt the k-nearest-neighbor strategy, where some fixed,
small, number kof nearest neighbors—say five—are located and used together
to determine the class of the test instance through a simple majority vote. (Note
that we used kto denote the number of attributes earlier; this is a different, inde-
pendent usage.) Another way of proofing the database against noise is to choose
the exemplars that are added to it selectively and judiciously; improved proce-
dures, described in Chapter 6, address these shortcomings.


Figure 4.15Ruling out an entire ball (gray) based on a target point (star) and its current
nearest neighbor.

Free download pdf