Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

6.6 CLUSTERING 257


the first five instances, there is no such host: it is better, in terms of category
utility, to form a new leaf for each instance. With the sixth it finally becomes
beneficial to form a cluster, joining the new instance fwith the old one—the
host—e. If you look back at Table 4.6 (page 103) you will see that the fifth and
sixth instances are indeed very similar, differing only in the windyattribute (and
play,which is being ignored here). The next example,g, is placed in the same
cluster (it differs from eonly in outlook). This involves another call to the clus-
tering procedure. First,gis evaluated to see which of the five children of the
root makes the best host; it turns out to be the rightmost, the one that is already
a cluster. Then the clustering algorithm is invoked with this as the root, and its
two children are evaluated to see which would make the better host. In this case
it proves best, according to the category utility measure, to add the new instance
as a subcluster in its own right.
If we were to continue in this vein, there would be no possibility of any radical
restructuring of the tree, and the final clustering would be excessively depend-
ent on the ordering of examples. To avoid this, there is provision for restruc-
turing, and you can see it come into play when instance his added in the next
step shown in Figure 6.17. In this case two existing nodes are mergedinto a single
cluster: nodes aand dare merged before the new instance his added. One way
of accomplishing this would be to consider all pairs of nodes for merging and
evaluate the category utility of each pair. However, that would be computa-
tionally expensive and would involve a lot of repeated work if it were under-
taken whenever a new instance was added.
Instead, whenever the nodes at a particular level are scanned for a suitable
host, both the best-matching node—the one that produces the greatest category
utility for the split at that level—and the runner-up are noted. The best one will
form the host for the new instance (unless that new instance is better off in a
cluster of its own). However, before setting to work on putting the new instance
in with the host, consideration is given to merging the host and the runner-up.
In this case,ais the preferred host and dis the runner-up. When a merge ofa
and dis evaluated, it turns out that it would improve the category utility
measure. Consequently, these two nodes are merged, yielding a version of the
fifth hierarchy of Figure 6.17 before his added. Then, consideration is given to
the placement ofhin the new, merged node; and it turns out to be best to make
it a subcluster in its own right, as shown.
An operation converse to merging is also implemented, called splitting,
although it does not take place in this particular example. Whenever the best
host is identified, and merging has not proved beneficial, consideration is given
to splitting the host node. Splitting has exactly the opposite effect of merging,
taking a node and replacing it with its children. For example, splitting the right-
most node in the fourth hierarchy of Figure 6.17 would raise the e, f,and gleaves
up a level, making them siblings ofa, b, c,and d. Merging and splitting provide

Free download pdf