Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

and this quantity, evaluated on the test set, has been used to evaluate the success of a rule when using reduced-error pruning. This measure is open to criticism because it treats noncoverage of negative examples as equally important as coverage of positive ones, which is unrealistic in a situation where what is being evaluated is one rule that will eventually serve alongside many others. For example, a rule that gets p =2000 instances right out of a total coverage of 3000 (i.e., it gets n =1000 wrong) is judged as more successful than one that gets p =1000 out of a total coverage of 1001 (i.e., n =1 wrong), because [p+(N-n)]/Tis [1000 +N]/Tin the first case but only [999 +N]/Tin the second. This is counterintuitive: the first rule is clearly less predictive than the second, because it has 33.0% as opposed to only 0.1% chance of being incorrect. Using the success rate p/tas a measure, as in the original formulation of the covering algorithm (Figure 4.8), is not the perfect solution either, because it would prefer a rule that got a single instance right (p =1) out of a total coverage of 1 (so n =0) to the far more useful rule that got 1000 right out of 1001. Another heuristic that has been used is (p-n)/t,but that suffers from exactly the same problem because (p-n)/t= 2 p/t-1 and so the result, when compar- ing one rule with another, is just the same as with the success rate. It seems hard to find a simple measure of the worth of a rule that corresponds with intuition in all cases. Whatever heuristic is used to measure the worth of a rule, the incremental reduced-error pruning algorithm is the same. A possible rule learning algorithm based on this idea is given in Figure 6.3. It generates a decision list, creating rules for each class in turn and choosing at each stage the best version of the rule according to its worth on the pruning data. The basic covering algorithm for rule generation (Figure 4.8) is used to come up with good rules for each class, choosing conditions to add to the rule using the accuracy measure p/tthat we described earlier. This method has been used to produce rule-induction schemes that can process vast amounts of data and operate very quickly. It can be accelerated by generating rules for the classes in order rather than generating a rule for each class at every stage and choosing the best. A suitable ordering is the increasing order in which they occur in the training set so that the rarest class is processed first and the most common ones are processed later. Another significant speedup is obtained by stopping the whole process when a rule of sufficiently low accuracy is generated, so as not to spend time generating a lot of rules at the end with very small coverage. However, very simple terminating conditions (such as stopping when the accuracy for a rule is lower than the default accuracy for the class it predicts) do not give the best performance, and the only conditions that have been found that seem to perform well are rather compli- cated ones based on the MDL principle.

204 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Get our desktop app

Company

Features

Documentation

Resources