Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

6.6 CLUSTERING 261


ing the values of attributes of instances in that cluster—that is, Pr[ai=vij|C]
is a better estimate of the probability that attribute aihas value vij, for an instance
in cluster C, than Pr[ai=vij] because it takes account of the cluster the instance
is in. If that information doesn’t help, the clusters aren’t doing much good! So
what the preceding measure calculates, inside the multiple summation, is the
amount by which that information doeshelp in terms of the differences between
squares of probabilities. This is not quite the standard squared-difference
metric, because that sums the squares of the differences (which produces a sym-
metric result), and the present measure sums the difference of the squares
(which, appropriately, does not produce a symmetric result). The differences
between squares of probabilities are summed over all attributes, and all their
possible values, in the inner double summation. Then it is summed over all clus-
ters, weighted by their probabilities, in the outer summation.
The overall division by kis a little hard to justify because the squared differ-
ences have already been summed over the categories. It essentially provides a
“per cluster” figure for the category utility that discourages overfitting. Other-
wise, because the probabilities are derived by summing over the appropriate
instances, the very best category utility would be obtained by placing each
instance in its own cluster. Then, Pr[ai=vij|C] would be 1 for the value that
attribute aiactually has for the single instance in category Cand 0 for all other
values; and the numerator of the category utility formula will end up as


where nis the total number of attributes. This is the greatest value that the
numerator can have; and so if it were not for the additional division by kin the
category utility formula, there would never be any incentive to form clusters
containing more than one member. This extra factor is best viewed as a rudi-
mentary overfitting-avoidance heuristic.
This category utility formula applies only to nominal attributes. However, it
can easily be extended to numeric attributes by assuming that their distribution
is normal with a given (observed) mean mand standard deviation s. The prob-
ability density function for an attribute ais


The analog of summing the squares of attribute–value probabilities is


where siis the standard deviation of the attribute ai.Thus for a numeric attrib-
ute, we estimate the standard deviation from the data, both within the cluster


jPraviij fadaii
i

 []= ¤Ú ( ) =


221
2 ps

,

fa

a
()=

Ê( - )
Ë

Á

ˆ
̄

̃

1

(^22)
2
ps^2
m
s
exp.
nav-=ÂiÂjPr[]iij^2 ,

Free download pdf