for the numeric attributes. This involves an additional, outer level of search. For
example, it initially evaluates the log-likelihood for 2, 3, 5, 7, 10, 15, and 25 clus-
ters: after that, it fits a log-normal distribution to the resulting data and ran-
domly selects from it more values to try. As you might imagine, the overall
algorithm is extremely computation intensive. In fact, the actual implementa-
tion starts with a prespecified time bound and continues to iterate as long as
time allows. Give it longer and the results may be better!
Discussion
The clustering methods that have been described produce different kinds of
output. All are capable of taking new data in the form of a test set and classify-
ing it according to clusters that were discovered by analyzing a training set.
However, the incremental clustering method is the only one that generates an
explicit knowledge structure that describes the clustering in a way that can be
visualized and reasoned about. The other algorithms produce clusters that could
be visualized in instance space if the dimensionality were not too high.
If a clustering method were used to label the instances of the training set with
cluster numbers, that labeled set could then be used to train a rule or decision
tree learner. The resulting rules or tree would form an explicit description of
the classes. A probabilistic clustering scheme could be used for the same
purpose, except that each instance would have multiple weighted labels and the
rule or decision tree learner would have to be able to cope with weighted
instances—as many can.
Another application of clustering is to fill in any values of the attributes that
may be missing. For example, it is possible to make a statistical estimate of the
value of unknown attributes of a particular instance, based on the class distri-
bution for the instance itself and the values of the unknown attributes for other
examples.
All the clustering methods we have examined make a basic assumption of
independence among the attributes. AutoClass does allow the user to specify
in advance that two or more attributes are dependent and should be modeled
with a joint probability distribution. (There are restrictions, however: nominal
attributes may vary jointly, as may numeric attributes, but not both together.
Moreover, missing values for jointly varying attributes are not catered for.) It
may be advantageous to preprocess a dataset to make the attributes more inde-
pendent, using a statistical technique such as the principal components trans-
form described in Section 7.3. Note that joint variation that is specific to
particular classes will not be removed by such techniques; they only remove
overall joint variation that runs across all classes.
Our description of how to modify k-means to find a good value ofkby
repeatedly splitting clusters and seeing whether the split is worthwhile follows
270 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES