Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
and this quantity, evaluated on the test set, has been used to evaluate the success
of a rule when using reduced-error pruning.
This measure is open to criticism because it treats noncoverage of negative
examples as equally important as coverage of positive ones, which is unrealistic
in a situation where what is being evaluated is one rule that will eventually serve
alongside many others. For example, a rule that gets p =2000 instances right
out of a total coverage of 3000 (i.e., it gets n =1000 wrong) is judged as more
successful than one that gets p =1000 out of a total coverage of 1001 (i.e.,
n =1 wrong), because [p+(N-n)]/Tis [1000 +N]/Tin the first case but only
[999 +N]/Tin the second. This is counterintuitive: the first rule is clearly less
predictive than the second, because it has 33.0% as opposed to only 0.1%
chance of being incorrect.
Using the success rate p/tas a measure, as in the original formulation of the
covering algorithm (Figure 4.8), is not the perfect solution either, because it
would prefer a rule that got a single instance right (p =1) out of a total cover-
age of 1 (so n =0) to the far more useful rule that got 1000 right out of 1001.
Another heuristic that has been used is (p-n)/t,but that suffers from exactly
the same problem because (p-n)/t= 2 p/t-1 and so the result, when compar-
ing one rule with another, is just the same as with the success rate. It seems hard
to find a simple measure of the worth of a rule that corresponds with intuition
in all cases.
Whatever heuristic is used to measure the worth of a rule, the incremental
reduced-error pruning algorithm is the same. A possible rule learning algorithm
based on this idea is given in Figure 6.3. It generates a decision list, creating rules
for each class in turn and choosing at each stage the best version of the rule
according to its worth on the pruning data. The basic covering algorithm for
rule generation (Figure 4.8) is used to come up with good rules for each class,
choosing conditions to add to the rule using the accuracy measure p/tthat we
described earlier.
This method has been used to produce rule-induction schemes that can
process vast amounts of data and operate very quickly. It can be accelerated by
generating rules for the classes in order rather than generating a rule for each
class at every stage and choosing the best. A suitable ordering is the increasing
order in which they occur in the training set so that the rarest class is processed
first and the most common ones are processed later. Another significant
speedup is obtained by stopping the whole process when a rule of sufficiently
low accuracy is generated, so as not to spend time generating a lot of rules at
the end with very small coverage. However, very simple terminating conditions
(such as stopping when the accuracy for a rule is lower than the default accu-
racy for the class it predicts) do not give the best performance, and the only
conditions that have been found that seem to perform well are rather compli-
cated ones based on the MDL principle.

204 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Free download pdf