Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
minimum people often run the algorithm several times with different initial
choices and choose the best final result—the one with the smallest total squared
distance.
It is easy to imagine situations in which k-means fails to find a good cluster-
ing. Consider four instances arranged at the vertices of a rectangle in two-
dimensional space. There are two natural clusters, formed by grouping together
the two vertices at either end of a short side. But suppose that the two initial
cluster centers happen to fall at the midpoints of the longsides. This forms a
stable configuration. The two clusters each contain the two instances at either
end of a long side—no matter how great the difference between the long and
the short sides.

Faster distance calculations


The k-means clustering algorithm usually requires several iterations, each
involving finding the distance ofkcluster centers from every instance to deter-
mine its cluster. There are simple approximations that speed this up consider-
ably. For example, you can project the dataset and make cuts along selected axes,
instead of using the arbitrary hyperplane divisions that are implied by choos-
ing the nearest cluster center. But this inevitably compromises the quality of the
resulting clusters.
Here’s a better way of speeding things up. Finding the closest cluster center
is not so different from finding nearest neighbors in instance-based learning.
Can the same efficient solutions—kD-trees and ball trees—be used? Yes! Indeed
they can be applied in an even more efficient way, because in each iteration of
k-means all the data points are processed together, whereas in instance-based
learning test instances are processed individually.
First, construct a kD-tree or ball tree for all the data points, which will remain
static throughout the clustering procedure. Each iteration ofk-means produces
a set of cluster centers, and all data points must be examined and assigned to
the nearest center. One way of processing the points is to descend the tree from
the root until reaching a leaf and check each individual point in the leaf to find
its closest cluster center. But it may be that the region represented by a higher
interior node falls entirely within the domain of a single cluster center. In that
case all the data points under that node can be processed in one blow!
The aim of the exercise, after all, is to find new positions for the cluster centers
by calculating the centroid of the points they contain. The centroid can be cal-
culated by keeping a running vector sum of the points in the cluster, and a count
of how many there are so far. At the end, just divide one by the other to find
the centroid. Suppose that with each node of the tree we store the vector sum
of the points within that node and a count of the number of points. If the whole
node falls within the ambit of a single cluster, the running totals for that cluster

138 CHAPTER 4| ALGORITHMS: THE BASIC METHODS

Free download pdf