6.6 CLUSTERING 255
independently. One way to split a cluster is to make a new seed, one standard
deviation away from the cluster’s center in the direction of its greatest variation,
and to make a second seed the same distance in the opposite direction.
(Alternatively, if this is too slow, choose a distance proportional to the cluster’s
bounding box and a random direction.) Then apply k-means to the points
in the cluster with these two new seeds.
Having tentatively split a cluster, is it worthwhile retaining the split or is the
original cluster equally plausible by itself? It’s no good looking at the total
squared distance of all points to their cluster center—this is bound to be smaller
for two subclusters. A penalty should be incurred for inventing an extra cluster,
and this is a job for the MDL criterion. That principle can be applied to see
whether the information required to specify the two new cluster centers, along
with the information required to specify each point with respect to them,
exceeds the information required to specify the original center and all the points
with respect to it. If so, the new clustering is unproductive and should be
abandoned.
If the split is retained, try splitting each new cluster further. Continue the
process until no worthwhile splits remain.
Additional implementation efficiency can be achieved by combining this
iterative clustering process with the kD-tree or ball tree data structure advocated
in Section 4.8. Then, the data points are reached by working down the tree
from the root. When considering splitting a cluster, there is no need to
consider the whole tree; just consider those parts of it that are needed to cover
the cluster. For example, when deciding whether to split the lower left cluster in
Figure 4.16(a) on page 140 (below the thick line), it is only necessary to con-
sider nodes A and B of the tree in Figure 4.16(b), because node C is irrelevant
to that cluster.
Incremental clustering
Whereas the k-means algorithm iterates over the whole dataset until
convergence is reached, the clustering methods that we examine next work
incrementally, instance by instance. At any stage the clustering forms a tree with
instances at the leaves and a root node that represents the entire dataset. In the
beginning the tree consists of the root alone. Instances are added one by one,
and the tree is updated appropriately at each stage. Updating may merely be a
case of finding the right place to put a leaf representing the new instance, or it
may involve radically restructuring the part of the tree that is affected by the
new instance. The key to deciding how and where to update is a quantity called
category utility,which measures the overall quality of a partition of instances
into clusters. We defer detailed consideration of how this is defined until the
next subsection and look first at how the clustering algorithm works.