7.5 COMBINING MULTIPLE MODELS 331
every possible location in the tree is considered for addition, and a node is added
according to a performance measure that depends on the particular boosting
algorithm employed. However, heuristics can be used instead of an exhaustive
search to speed up the learning process.
Logistic model trees
Option trees and alternating trees yield very good classification performance
based on a single structure, but they may still be difficult to interpret when there
are many options nodes because it becomes difficult to see how a particular pre-
diction is derived. However, it turns out that boosting can also be used to build
very effective decision trees that do not include any options at all. For example,
the LogitBoost algorithm has been used to induce trees with linear logistic
regression models at the leaves. These are called logistic model treesand are
interpreted in the same way as the model trees for regression described in
Section 6.5.
LogitBoost performs additive logistic regression. Suppose that each iteration
of the boosting algorithm fits a simple regression function by going through all
the attributes, finding the simple regression function with the smallest error,
and adding it into the additive model. If the LogitBoost algorithm is run until
convergence, the result is a maximum likelihood multiple-logistic regression
model. However, for optimum performance on future data it is usually unnec-
essary to wait for convergence—and to do so is often detrimental. An appro-
priate number of boosting iterations can be determined by estimating the
expected performance for a given number of iterations using cross-validation
and stopping the process when performance ceases to increase.
A simple extension of this algorithm leads to logistic model trees. The boost-
ing process terminates when there is no further structure in the data that can
be modeled using a linear logistic regression function. However, there may still
be a structure that linear models can fit if attention is restricted to subsets of
the data, obtained, for example, by a standard decision tree criterion such as
information gain. Then, once no further improvement can be obtained by
adding more simple linear models, the data is split and boosting is resumed sep-
arately in each subset. This process takes the logistic model generated so far and
refines it separately for the data in each subset. Again, cross-validation is run in
each subset to determine an appropriate number of iterations to perform in that
subset.
The process is applied recursively until the subsets become too small. The
resulting tree will surely overfit the training data, and one of the standard
methods of decision tree learning can be used to prune it. Experiments indicate
that the pruning operation is very important. Using a strategy that chooses the
right tree size using cross-validation, the algorithm produces small but very
accurate trees with linear logistic models at the leaves.