Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

weighted learning, primarily in the context of regression problems. Frank et al. (2003) evaluated the use of locally weighted learning in conjunction with Naïve Bayes.

6.6 Clustering

In Section 4.8 we examined the k-means clustering algorithm in which kinitial points are chosen to represent initial cluster centers, all data points are assigned to the nearest one, the mean value of the points in each cluster is computed to form its new cluster center, and iteration continues until there are no changes in the clusters. This procedure only works when the number of clusters is known in advance, and this section begins by describing what you can do if it is not. Next we examine two techniques that do not partition instances into disjoint clusters as k-means does. The first is an incremental clustering method that was developed in the late 1980s and embodied in a pair of systems called Cobweb (for nominal attributes) and Classit (for numeric attributes). Both come up with a hierarchical grouping of instances and use a measure of cluster “quality” called category utility. The second is a statistical clustering method based on a mixture model of different probability distributions, one for each cluster. It assigns instances to classes probabilistically, not deterministically. We explain the basic technique and sketch the working of a comprehensive clustering scheme called AutoClass.

Choosing the number of clusters

Suppose you are using k-means but do not know the number of clusters in advance. One solution is to try out different possibilities and see which is best— that is, which one minimizes the total squared distance of all points to their cluster center. A simple strategy is to start from a given minimum, perhaps k =1, and work up to a small fixed maximum, using cross-validation to find the best value. Because k-means is slow, and cross-validation makes it even slower, it will probably not be feasible to try many possible values for k. Note that on the training data the “best” clustering according to the total squared distance criterion will always be to choose as many clusters as there are data points! To penalize solutions with many clusters you have to apply something like the MDL criterion of Section 5.10, or use cross-validation. Another possibility is to begin by finding a few clusters and determining whether it is worth splitting them. You could choose k =2, perform k-means clustering until it terminates, and then consider splitting each cluster. Compu- tation time will be reduced considerably if the initial two-way clustering is considered irrevocable and splitting is investigated for each component

254 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

6.6 Clustering

Choosing the number of clusters

Get our desktop app

Company

Features

Documentation

Resources