As we saw in Section 3.9, there are different ways in which the result of clus-
tering can be expressed. The groups that are identified may be exclusive so that
any instance belongs in only one group. Or they may be overlapping so that an
instance may fall into several groups. Or they may be probabilistic, whereby an
instance belongs to each group with a certain probability. Or they may be hier-
archical, such that there is a crude division of instances into groups at the top
level, and each of these groups is refined further—perhaps all the way down to
individual instances. Really, the choice among these possibilities should be dic-
tated by the nature of the mechanisms that are thought to underlie the partic-
ular clustering phenomenon. However, because these mechanisms are rarely
known—the very existence of clusters is, after all, something that we’re trying
to discover—and for pragmatic reasons too, the choice is usually dictated by the
clustering tools that are available.
We will examine an algorithm that forms clusters in numeric domains, par-
titioning instances into disjoint clusters. Like the basic nearest-neighbor method
of instance-based learning, it is a simple and straightforward technique that
has been used for several decades. In Chapter 6 we examine newer clustering
methods that perform incremental and probabilistic clustering.
Iterative distance-based clustering
The classic clustering technique is called k-means.First, you specify in advance
how many clusters are being sought: this is the parameter k. Then kpoints are
chosen at random as cluster centers. All instances are assigned to their closest
cluster center according to the ordinary Euclidean distance metric. Next the cen-
troid, or mean, of the instances in each cluster is calculated—this is the “means”
part. These centroids are taken to be new center values for their respective clus-
ters. Finally, the whole process is repeated with the new cluster centers. Itera-
tion continues until the same points are assigned to each cluster in consecutive
rounds, at which stage the cluster centers have stabilized and will remain the
same forever.
This clustering method is simple and effective. It is easy to prove that choos-
ing the cluster center to be the centroid minimizes the total squared distance
from each of the cluster’s points to its center. Once the iteration has stabilized,
each point is assigned to its nearest cluster center, so the overall effect is to min-
imize the total squared distance from all points to their cluster centers. But the
minimum is a local one; there is no guarantee that it is the global minimum.
The final clusters are quite sensitive to the initial cluster centers. Completely dif-
ferent arrangements can arise from small changes in the initial random choice.
In fact, this is true of all practical clustering techniques: it is almost always infea-
sible to find globally optimal clusters. To increase the chance of finding a global
4.8 CLUSTERING 137