Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
these errors by learning a second model—perhaps another regression tree—that
tries to predict the observed residuals. To do this, simply replace the original
class values by their residuals before learning the second model. Adding the pre-
dictions made by the second model to those of the first one automatically yields
lower error on the training data. Usually some residuals still remain, because
the second model is not a perfect one, so we continue with a third model that
learns to predict the residuals of the residuals, and so on. The procedure is rem-
iniscent of the use of rules with exceptions for classification that we met in
Section 3.5.
If the individual models minimize the squared error of the predictions, as
linear regression models do, this algorithm minimizes the squared error of
the ensemble as a whole. In practice it also works well when the base learner
uses a heuristic approximation instead, such as the regression and model tree
learners described in Section 6.5. In fact, there is no point in using standard
linear regression as the base learner for additive regression, because the sum of
linear regression models is again a linear regression model and the regression
algorithm itself minimizes the squared error. However, it is a different story if
the base learner is a regression model based on a single attribute, the one that
minimizes the squared error. Statisticians call this simplelinear regression, in
contrast to the standard multiattribute method, properly called multiplelinear
regression. In fact, using additive regression in conjunction with simple linear
regression and iterating until the squared error of the ensemble decreases no
further yields an additive model identical to the least-squares multiple linear
regression function.
Forward stagewise additive regression is prone to overfitting because each
model added fits the training data more closely. To decide when to stop, use
cross-validation. For example, perform a cross-validation for every number of
iterations up to a user-specified maximum and choose the one that minimizes
the cross-validated estimate of squared error. This is a good stopping criterion
because cross-validation yields a fairly reliable estimate of the error on future
data. Incidentally, using this method in conjunction with simple linear regres-
sion as the base learner effectively combines multiple linear regression with
built-in attribute selection, because the next most important attribute’s contri-
bution is only included if it decreases the cross-validated error.
For implementation convenience, forward stagewise additive regression
usually begins with a level-0 model that simply predicts the mean of the class
on the training data so that every subsequent model fits residuals. This suggests
another possibility for preventing overfitting: instead of subtracting a model’s
entire prediction to generate target values for the next model, shrink the pre-
dictions by multiplying them by a user-specified constant factor between 0 and
1 before subtracting. This reduces the model’s fit to the residuals and conse-
quently reduces the chance of overfitting. Of course, it may increase the number

326 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT

Free download pdf