Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
weighted learning, primarily in the context of regression problems. Frank et al.
(2003) evaluated the use of locally weighted learning in conjunction with Naïve
Bayes.

6.6 Clustering


In Section 4.8 we examined the k-means clustering algorithm in which kinitial
points are chosen to represent initial cluster centers, all data points are assigned
to the nearest one, the mean value of the points in each cluster is computed to
form its new cluster center, and iteration continues until there are no changes
in the clusters. This procedure only works when the number of clusters is known
in advance, and this section begins by describing what you can do if it is not.
Next we examine two techniques that do not partition instances into disjoint
clusters as k-means does. The first is an incremental clustering method that was
developed in the late 1980s and embodied in a pair of systems called Cobweb
(for nominal attributes) and Classit (for numeric attributes). Both come up with
a hierarchical grouping of instances and use a measure of cluster “quality” called
category utility. The second is a statistical clustering method based on a mixture
model of different probability distributions, one for each cluster. It assigns
instances to classes probabilistically, not deterministically. We explain the basic
technique and sketch the working of a comprehensive clustering scheme called
AutoClass.

Choosing the number of clusters


Suppose you are using k-means but do not know the number of clusters in
advance. One solution is to try out different possibilities and see which is best—
that is, which one minimizes the total squared distance of all points to their
cluster center. A simple strategy is to start from a given minimum, perhaps
k =1, and work up to a small fixed maximum, using cross-validation to find the
best value. Because k-means is slow, and cross-validation makes it even slower,
it will probably not be feasible to try many possible values for k. Note that on
the training data the “best” clustering according to the total squared distance
criterion will always be to choose as many clusters as there are data points! To
penalize solutions with many clusters you have to apply something like the MDL
criterion of Section 5.10, or use cross-validation.
Another possibility is to begin by finding a few clusters and determining
whether it is worth splitting them. You could choose k =2, perform k-means
clustering until it terminates, and then consider splitting each cluster. Compu-
tation time will be reduced considerably if the initial two-way clustering
is considered irrevocable and splitting is investigated for each component

254 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Free download pdf