Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

The third metalearner,ThresholdSelector, optimizes the F-measure (Section 5.7) by selecting a probability threshold on the classifier’s output. Performance can be measured on the training data, on a holdout set, or by cross-validation. The probabilities returned by the base learner can be rescaled into the full range [0,1], which is useful if the scheme’s probabilities are restricted to a narrow sub- range. The metalearner can be applied to multiclass problems by specifying the class value for which the optimization is performed as

The first class value

The second class value

Whichever value is least frequent

Whichever value is most frequent

The first class named yes,pos(itive),or 1.

Retargeting classifiers for different tasks

Four metalearners adapt learners designed for one kind of task to another.Clas- sificationViaRegressionperforms classification using a regression method by binarizing the class and building a regression model for each value.Regression- ByDiscretizationis a regression scheme that discretizes the class attribute into a specified number of bins using equal-width discretization and then employs a classifier. The predictions are the weighted average of the mean class value for each discretized interval, with weights based on the predicted probabilities for the intervals.OrdinalClassClassifierapplies standard classification algorithms to ordinal-class problems (Frank and Hall 2001). MultiClassClassifierhandles multiclass problems with two-class classifiers using any of these methods:

One versus all the rest

Pairwise classification using voting to predict

Exhaustive error-correcting codes (Section 7.5, page 334)

Randomly selected error-correcting codes

Random code vectors are known to have good error-correcting properties: a parameter specifies the length of the code vector (in bits).

10.6 Clustering algorithms

Table 10.7 lists Weka’s clustering algorithms; the first two and SimpleKMeansare described in Section 6.6. For the EMimplementation you can specify how many clusters to generate or the algorithm can decide using cross-validation—in which case the number of folds is fixed at 10 (unless there are fewer than 10 training instances). You can specify the maximum number of iterations and set the minimum allowable standard deviation for the normal density calculation.

418 CHAPTER 10 | THE EXPLORER

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Retargeting classifiers for different tasks

10.6 Clustering algorithms

Get our desktop app

Company

Features

Documentation

Resources