The more recent version, C5.0, is available commercially. Its decision tree induc-
tion seems to be essentially the same as that used by C4.5, and tests show some
differences but negligible improvements. However, its rule generation is greatly
sped up and clearly uses a different technique, although this has not been
described in the open literature.
C4.5 works essentially as described in the preceding sections. The default con-
fidence value is set at 25% and works reasonably well in most cases; possibly it
should be altered to a lower value, which causes more drastic pruning, if the
actual error rate of pruned trees on test sets is found to be much higher than
the estimated error rate. There is one other important parameter whose effect
is to eliminate tests for which almost all of the training examples have the same
outcome. Such tests are often of little use. Consequently, tests are not incorpo-
rated into the decision tree unless they have at least two outcomes that have at
least a minimum number of instances. The default value for this minimum is
2, but it is controllable and should perhaps be increased for tasks that have a lot
of noisy data.
Discussion
Top-down induction of decision trees is probably the most extensively
researched method of machine learning used in data mining. Researchers have
investigated a panoply of variations for almost every conceivable aspect of the
learning process—for example, different criteria for attribute selection or
modified pruning methods. However, they are rarely rewarded by substantial
improvements in accuracy over a spectrum of diverse datasets. Sometimes the
size of the induced trees is significantly reduced when a different pruning strat-
egy is adopted, but often the same effect can be achieved by setting C4.5’s
pruning parameter to a smaller value.
In our description of decision trees, we have assumed that only one
attribute is used to split the data into subsets at each node of the tree.
However, it is possible to allow tests that involve several attributes at a time.
For example, with numeric attributes each test can be on a linear combination
of attribute values. Then the final tree consists of a hierarchy of linear models
of the kind we described in Section 4.6, and the splits are no longer restricted
to being axis-parallel. Trees with tests involving more than one attribute are
called multivariatedecision trees, in contrast to the simple univariatetrees
that we normally use. Multivariate tests were introduced with the classification
and regression trees(CART) system for learning decision trees (Breiman et al.
1984). They are often more accurate and smaller than univariate trees but take
much longer to generate and are also more difficult to interpret. We briefly
mention one way of generating them using principal components analysis in
Section 7.3 (page 309).
6.1 DECISION TREES 199