Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

minimum people often run the algorithm several times with different initial choices and choose the best final result—the one with the smallest total squared distance. It is easy to imagine situations in which k-means fails to find a good clustering. Consider four instances arranged at the vertices of a rectangle in two- dimensional space. There are two natural clusters, formed by grouping together the two vertices at either end of a short side. But suppose that the two initial cluster centers happen to fall at the midpoints of the longsides. This forms a stable configuration. The two clusters each contain the two instances at either end of a long side—no matter how great the difference between the long and the short sides.

Faster distance calculations

The k-means clustering algorithm usually requires several iterations, each involving finding the distance ofkcluster centers from every instance to deter- mine its cluster. There are simple approximations that speed this up consider- ably. For example, you can project the dataset and make cuts along selected axes, instead of using the arbitrary hyperplane divisions that are implied by choos- ing the nearest cluster center. But this inevitably compromises the quality of the resulting clusters. Here’s a better way of speeding things up. Finding the closest cluster center is not so different from finding nearest neighbors in instance-based learning. Can the same efficient solutions—kD-trees and ball trees—be used? Yes! Indeed they can be applied in an even more efficient way, because in each iteration of k-means all the data points are processed together, whereas in instance-based learning test instances are processed individually. First, construct a kD-tree or ball tree for all the data points, which will remain static throughout the clustering procedure. Each iteration ofk-means produces a set of cluster centers, and all data points must be examined and assigned to the nearest center. One way of processing the points is to descend the tree from the root until reaching a leaf and check each individual point in the leaf to find its closest cluster center. But it may be that the region represented by a higher interior node falls entirely within the domain of a single cluster center. In that case all the data points under that node can be processed in one blow! The aim of the exercise, after all, is to find new positions for the cluster centers by calculating the centroid of the points they contain. The centroid can be cal- culated by keeping a running vector sum of the points in the cluster, and a count of how many there are so far. At the end, just divide one by the other to find the centroid. Suppose that with each node of the tree we store the vector sum of the points within that node and a count of the number of points. If the whole node falls within the ambit of a single cluster, the running totals for that cluster

138 CHAPTER 4| ALGORITHMS: THE BASIC METHODS

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Faster distance calculations

Get our desktop app

Company

Features

Documentation

Resources