Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
The third metalearner,ThresholdSelector, optimizes the F-measure (Section
5.7) by selecting a probability threshold on the classifier’s output. Performance
can be measured on the training data, on a holdout set, or by cross-validation.
The probabilities returned by the base learner can be rescaled into the full range
[0,1], which is useful if the scheme’s probabilities are restricted to a narrow sub-
range. The metalearner can be applied to multiclass problems by specifying the
class value for which the optimization is performed as


  1. The first class value

  2. The second class value

  3. Whichever value is least frequent

  4. Whichever value is most frequent

  5. The first class named yes,pos(itive),or 1.


Retargeting classifiers for different tasks


Four metalearners adapt learners designed for one kind of task to another.Clas-
sificationViaRegressionperforms classification using a regression method by
binarizing the class and building a regression model for each value.Regression-
ByDiscretizationis a regression scheme that discretizes the class attribute into a
specified number of bins using equal-width discretization and then employs a
classifier. The predictions are the weighted average of the mean class value for
each discretized interval, with weights based on the predicted probabilities for
the intervals.OrdinalClassClassifierapplies standard classification algorithms to
ordinal-class problems (Frank and Hall 2001). MultiClassClassifierhandles
multiclass problems with two-class classifiers using any of these methods:


  1. One versus all the rest

  2. Pairwise classification using voting to predict

  3. Exhaustive error-correcting codes (Section 7.5, page 334)

  4. Randomly selected error-correcting codes


Random code vectors are known to have good error-correcting properties: a
parameter specifies the length of the code vector (in bits).

10.6 Clustering algorithms


Table 10.7 lists Weka’s clustering algorithms; the first two and SimpleKMeansare
described in Section 6.6. For the EMimplementation you can specify how many
clusters to generate or the algorithm can decide using cross-validation—in
which case the number of folds is fixed at 10 (unless there are fewer than 10
training instances). You can specify the maximum number of iterations and set
the minimum allowable standard deviation for the normal density calculation.

418 CHAPTER 10 | THE EXPLORER

Free download pdf