Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
undefined subtrees. To generate such a tree, the construction and pruning oper-
ations are integrated in order to find a “stable” subtree that can be simplified no
further. Once this subtree has been found, tree building ceases and a single rule
is read off.
The tree-building algorithm is summarized in Figure 6.5: it splits a set of
instances recursively into a partial tree. The first step chooses a test and divides
the instances into subsets accordingly. The choice is made using the same infor-
mation-gain heuristic that is normally used for building decision trees (Section
4.3). Then the subsets are expanded in increasing order of their average entropy.
The reason for this is that the later subsets will most likely not end up being
expanded, and a subset with low average entropy is more likely to result in a
small subtree and therefore produce a more general rule. This proceeds recur-
sively until a subset is expanded into a leaf, and then continues further by back-
tracking. But as soon as an internal node appears that has all its children
expanded into leaves, the algorithm checks whether that node is better replaced
by a single leaf. This is just the standard subtree replacement operation of
decision tree pruning (Section 6.1). If replacement is performed the algorithm
backtracks in the standard way, exploring siblings of the newly replaced node.
However, if during backtracking a node is encountered all of whose children are
not leaves—and this will happen as soon as a potential subtree replacement is
notperformed—then the remaining subsets are left unexplored and the corre-
sponding subtrees are left undefined. Because of the recursive structure of the
algorithm, this event automatically terminates tree generation.
Figure 6.6 shows a step-by-step example. During the stages in Figure 6.6(a)
through (c), tree building continues recursively in the normal way—except that

208 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES


Expand-subset (S):
Choose a test T and use it to split the set of examples into subsets
Sort subsets into increasing order of average entropy
while (there is a subset X that has not yet been expanded
AND all subsets expanded so far are leaves)
expand-subset(X)
if (all the subsets expanded are leaves
AND estimated error for subtree ≥ estimated error for node)
undo expansion into subsets and make node a leaf

Figure 6.5Algorithm for expanding examples into a partial tree.

Free download pdf