Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
an incremental way of restructuring the tree to compensate for incorrect choices
caused by infelicitous ordering of examples.
The final hierarchy for all 14 examples is shown at the end of Figure 6.17.
There are two major clusters, each of which subdivides further into its own
subclusters. If the play/don’t playdistinction really represented an inherent
feature of the data, a single cluster would be expected for each outcome. No
such clean structure is observed, although a (very) generous eye might discern
a slight tendency at lower levels for yesinstances to group together, and likewise
for noinstances. Careful analysis of the clustering reveals some anomalies.
(Table 4.6 will help if you want to follow this analysis in detail.) For example,
instances aand bare actually very similar to each other, yet they end up in com-
pletely different parts of the tree. Instance bends up with k,which is a worse
match than a.Instance aends up with dand h,and it is certainly not as similar
to das it is to b. The reason why aand bbecome separated is that aand dget
merged, as described previously, because they form the best and second-best
hosts for h. It was unlucky that aand bwere the first two examples: if either
had occurred later, it may well have ended up with the other. Subsequent split-
ting and remerging may be able to rectify this anomaly, but in this case they
didn’t.
Exactly the same scheme works for numeric attributes. Category utility is
defined for these as well, based on an estimate of the mean and standard devi-
ation of the value of that attribute. Details are deferred to the next subsection.
However, there is just one problem that we must attend to here: when estimat-
ing the standard deviation of an attribute for a particular node, the result will
be zero if the node contains only one instance, as it does more often than not.
Unfortunately, zero variances produce infinite values in the category utility
formula. A simple heuristic solution is to impose a minimum variance on each
attribute. It can be argued that because no measurement is completely precise,
it is reasonable to impose such a minimum: it represents the measurement error
in a single sample. This parameter is called acuity.
Figure 6.18(a) shows, at the top, a hierarchical clustering produced by the
incremental algorithm for part of the Iris dataset (30 instances, 10 from each
class). At the top level there are two clusters (i.e., subclusters of the single node
representing the whole dataset). The first contains both Iris virginicasand Iris
versicolors,and the second contains only Iris setosas.The Iris setosasthemselves
split into two subclusters, one with four cultivars and the other with six. The
other top-level cluster splits into three subclusters, each with a fairly complex
structure. Both the first and second contain only Iris versicolors,with one excep-
tion, a stray Iris virginica,in each case; the third contains only Iris virginicas.
This represents a fairly satisfactory clustering of the Iris data: it shows that the
three genera are not artificial at all but reflect genuine differences in the data.
This is, however, a slightly overoptimistic conclusion, because quite a bit of

258 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Free download pdf