method that builds a full decision tree each time. The gain decreases as more
pruning takes place. For datasets with numeric attributes, the asymptotic time
complexity of the algorithm is the same as building the full decision tree,
because in this case the complexity is dominated by the time required to sort
the attribute values in the first place.
Once a partial tree has been built, a single rule is extracted from it. Each leaf
corresponds to a possible rule, and we seek the “best” leaf of those subtrees
(typically a small minority) that have been expanded into leaves. Experiments
show that it is best to aim at the most general rule by choosing the leaf that
covers the greatest number of instances.
When a dataset contains missing values, they can be dealt with exactly as they
are when building decision trees. If an instance cannot be assigned to any given
branch because of a missing attribute value, it is assigned to each of the branches
with a weight proportional to the number of training instances going down that
branch, normalized by the total number of training instances with known values
at the node. During testing, the same procedure is applied separately to each
rule, thus associating a weight with the application of each rule to the test
instance. That weight is deducted from the instance’s total weight before it is
passed to the next rule in the list. Once the weight has reduced to zero, the pre-
dicted class probabilities are combined into a final classification according to
the weights.
This yields a simple but surprisingly effective method for learning decision
lists for noisy data. Its main advantage over other comprehensive rule-
generation schemes is simplicity, because other methods require a complex
global optimization stage to achieve the same level of performance.
Rules with exceptions
In Section 3.5 we learned that a natural extension of rules is to allow them to
have exceptions, and exceptions to the exceptions, and so on—indeed the whole
rule set can be considered as exceptions to a default classification rule that is
used when no other rules apply. The method of generating a “good” rule, using
one of the measures described in the previous section, provides exactly the
mechanism needed to generate rules with exceptions.
First, a default class is selected for the top-level rule: it is natural to use the
class that occurs most frequently in the training data. Then, a rule is found per-
taining to any class other than the default one. Of all such rules it is natural to
seek the one with the most discriminatory power, for example, the one with the
best evaluation on a test set. Suppose this rule has the form
if <condition> then class = <new class>
210 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES