Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

and variance: this is the bias–variance decomposition.^4 Combining multiple classifiers decreases the expected error by reducing the variance component. The more classifiers that are included, the greater the reduction in variance. Of course, a difficulty arises when putting this voting method into practice: usually there’s only one training set, and obtaining more data is either impos- sible or expensive. Bagging attempts to neutralize the instability of learning methods by simu- lating the process described previously using a given training set. Instead of sampling a fresh, independent training dataset each time, the original training data is altered by deleting some instances and replicating others. Instances are ran- domly sampled, with replacement, from the original dataset to create a new one of the same size. This sampling procedure inevitably replicates some of the instances and deletes others. If this idea strikes a chord, it is because we described it in Chapter 5 when explaining the bootstrap method for estimating the generalization error of a learning method (Section 5.4): indeed, the term baggingstands for bootstrap aggregating. Bagging applies the learning scheme— for example, a decision tree inducer—to each one of these artificially derived datasets, and the classifiers generated from them vote for the class to be pre- dicted. The algorithm is summarized in Figure 7.7. The difference between bagging and the idealized procedure described previously is the way in which the training datasets are derived. Instead of obtaining independent datasets from the domain, bagging just resamples the original training data. The datasets generated by resampling are different from one another but are certainly not independent because they are all based on one dataset. However, it turns out that bagging produces a combined model that often performs significantly better than the single model built from the original training data, and is never substantially worse. Bagging can also be applied to learning methods for numeric prediction— for example, model trees. The only difference is that, instead of voting on the outcome, the individual predictions, being real numbers, are averaged. The bias–variance decomposition can be applied to numeric prediction as well by decomposing the expected value of the mean-squared error of the predictions on fresh data. Bias is defined as the mean-squared error expected when averaging over models built from all possible training datasets of the same size, and variance is the component of the expected error of a single model that is due to the particular training data it was built from. It can be shown theoretically that averaging over multiple models built from independent training sets always

318 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT

(^4) This is a simplified version of the full story. Several different methods for performing the
bias–variance decomposition can be found in the literature; there is no agreed way of doing
this.

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Get our desktop app

Company

Features

Documentation

Resources