Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

Discussion


In a seminal paper titled “Very simple classification rules perform well on most
commonly used datasets” (Holte 1993), a comprehensive study of the perform-
ance of the 1R procedure was reported on 16 datasets frequently used by
machine learning researchers to evaluate their algorithms. Throughout, the
study used cross-validation,an evaluation technique that we will explain in
Chapter 5, to ensure that the results were representative of what independent
test sets would yield. After some experimentation, the minimum number of
examples in each partition of a numeric attribute was set at six, not three as
used for the preceding illustration.
Surprisingly, despite its simplicity 1R did astonishingly—even embarrass-
ingly—well in comparison with state-of-the-art learning methods, and the rules
it produced turned out to be just a few percentage points less accurate, on almost
all of the datasets, than the decision trees produced by a state-of-the-art deci-
sion tree induction scheme. These trees were, in general, considerably larger
than 1R’s rules. Rules that test a single attribute are often a viable alternative to
more complex structures, and this strongly encourages a simplicity-first meth-
odology in which the baseline performance is established using simple, rudi-
mentary techniques before progressing to more sophisticated learning methods,
which inevitably generate output that is harder for people to interpret.
The 1R procedure learns a one-level decision tree whose leaves represent the
various different classes. A slightly more expressive technique is to use a differ-
ent rule for each class. Each rule is a conjunction of tests, one for each attribute.
For numeric attributes the test checks whether the value lies within a given inter-
val; for nominal ones it checks whether it is in a certain subset of that attribute’s
values. These two types of tests—intervals and subset—are learned from the
training data pertaining to each class. For a numeric attribute, the endpoints of
the interval are the minimum and maximum values that occur in the training
data for that class. For a nominal one, the subset contains just those values that
occur for that attribute in the training data for the class. Rules representing dif-
ferent classes usually overlap, and at prediction time the one with the most
matching tests is predicted. This simple technique often gives a useful first
impression of a dataset. It is extremely fast and can be applied to very large
quantities of data.

4.2 Statistical modeling


The 1R method uses a single attribute as the basis for its decisions and chooses
the one that works best. Another simple technique is to use all attributes and
allow them to make contributions to the decision that are equally importantand
independentof one another, given the class. This is unrealistic, of course: what

88 CHAPTER 4| ALGORITHMS: THE BASIC METHODS

Free download pdf