Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

6.2 CLASSIFICATION RULES 213


Of the rules for these instances that do not predict the default class Iris setosa,
the best is


if petal length ≥ 3.35 then Iris virginica

It covers all 47 Iris virginicasthat are in the example set (3 were removed by the
first rule, as explained previously). It also covers 1 Iris versicolor.This needs to
be taken care of as an exception, by the final rule:


if petal length < 4.85 and sepal length <5.95 then Iris versicolor

Fortunately, the set of instances that do notsatisfy its condition are all the
default,Iris setosa.Thus the procedure is finished.
The rules that are produced have the property that most of the examples are
covered by the high-level rules and the lower-level ones really do represent
exceptions. For example, the last exception clause in the preceding rules and the
deeply nested elseclause both cover a solitary example, and removing them
would have little effect. Even the remaining nested exception rule covers only
two examples. Thus one can get an excellent feeling for what the rules do by
ignoring all the deeper structure and looking only at the first level or two. That
is the attraction of rules with exceptions.


Discussion


All algorithms for producing classification rules that we have described use the
basic covering or separate-and-conquer approach. For the simple, noise-free
case this produces PRISM (Cendrowska 1987), an algorithm that is simple and
easy to understand. When applied to two-class problems with the closed world
assumption, it is only necessary to produce rules for one class: then the rules
are in disjunctive normal form and can be executed on test instances without
any ambiguity arising. When applied to multiclass problems, a separate rule set
is produced for each class: thus a test instance may be assigned to more than
one class, or to no class, and further heuristics are necessary if a unique pre-
diction is sought.
To reduce overfitting in noisy situations, it is necessary to produce rules that
are not “perfect” even on the training set. To do this it is necessary to have a
measure for the “goodness,” or worth, of a rule. With such a measure it is then
possible to abandon the class-by-class approach of the basic covering algorithm
and start by generating the very best rule, regardless of which class it predicts,
and then remove all examples covered by this rule and continue the process.
This yields a method for producing a decision list rather than a set of inde-
pendent classification rules, and decision lists have the important advantage that
they do not generate ambiguities when interpreted.

Free download pdf