Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
experimentation with the acuity parameter was necessary to obtain such a nice
division.
The clusterings produced by this scheme contain one leaf for every instance.
This produces an overwhelmingly large hierarchy for datasets of any reasonable
size, corresponding, in a sense, to overfitting the particular dataset. Conse-
quently, a second numeric parameter called cutoffis used to suppress growth.
Some instances are deemed to be sufficiently similar to others to not warrant
formation of their own child node, and this parameter governs the similarity
threshold. Cutoff is specified in terms of category utility: when the increase in
category utility from adding a new node is sufficiently small, that node is cut
off.
Figure 6.18(b) shows the same Iris data, clustered with cutoff in effect. Many
leaf nodes contain several instances: these are children of the parent node that
have been cut off. The division into the three types of iris is a little easier to see
from this hierarchy because some of the detail is suppressed. Again, however,
some experimentation with the cutoff parameter was necessary to get this result,
and in fact a sharper cutoff leads to much less satisfactory clusters.
Similar clusterings are obtained if the full Iris dataset of 150 instances is used.
However, the results depend on the ordering of examples: Figure 6.18 was
obtained by alternating the three varieties of iris in the input file. If all Iris setosas
are presented first, followed by all Iris versicolorsand then all Iris virginicas,the
resulting clusters are quite unsatisfactory.

Category utility


Now we look at how the category utility, which measures the overall quality of
a partition of instances into clusters, is calculated. In Section 5.9 we learned how
the MDL measure could, in principle, be used to evaluate the quality of clus-
tering. Category utility is not MDL based but rather resembles a kind of quad-
ratic loss function defined on conditional probabilities.
The definition of category utility is rather formidable:

where C 1 ,C 2 ,...,Ckare the kclusters; the outer summation is over these clus-
ters; the next inner one sums over the attributes;aiis the ith attribute, and it
takes on values vi 1 ,vi 2 ,...which are dealt with by the sum over j. Note that the
probabilities themselves are obtained by summing over all instances: thus there
is a further implied level of summation.
This expression makes a great deal of sense if you take the time to examine
it. The point of having a cluster is that it will give some advantage in predict-

CU C C C

CavCav
k

k

i j iij iij
12

22
( , ,..., )=

ÂlPr[]llÂÂ(Pr[]= -=Pr[])


260 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Free download pdf