for this purpose. In the words of David Wolpert, the inventor of stacking, it is
reasonable that “relatively global, smooth” level-1 generalizers should perform
well. Simple linear models or trees with linear models at the leaves usually work
well.
Stacking can also be applied to numeric prediction. In that case, the level-0
models and the level-1 model all predict numeric values. The basic mechanism
remains the same; the only difference lies in the nature of the level-1 data. In
the numeric case, each level-1 attribute represents the numeric prediction made
by one of the level-0 models, and instead of a class value the numeric target
value is attached to level-1 training instances.
Error-correcting output codes
Error-correcting output codes are a technique for improving the performance
of classification algorithms in multiclass learning problems. Recall from Chapter
6 that some learning algorithms—for example, standard support vector
machines—only work with two-class problems. To apply such algorithms to
multiclass datasets, the dataset is decomposed into several independent two-
class problems, the algorithm is run on each one, and the outputs of the result-
ing classifiers are combined. Error-correcting output codes are a method for
making the most of this transformation. In fact, the method works so well that
it is often advantageous to apply it even when the learning algorithm can handle
multiclass datasets directly.
In Section 4.6 (page 123) we learned how to transform a multiclass dataset
into several two-class ones. For each class, a dataset is generated containing a
copy of each instance in the original data, but with a modified class value. If the
instance has the class associated with the corresponding dataset it is tagged yes;
otherwise no.Then classifiers are built for each of these binary datasets, classi-
fiers that output a confidence figure with their predictions—for example, the
estimated probability that the class is yes.During classification, a test instance
is fed into each binary classifier, and the final class is the one associated with the
classifier that predicts yesmost confidently. Of course, this method is sensitive
to the accuracy of the confidence figures produced by the classifiers: if some
classifiers have an exaggerated opinion of their own predictions, the overall
result will suffer.
Consider a multiclass problem with the four classes a, b, c,and d.The trans-
formation can be visualized as shown in Table 7.1(a), where yesand noare
mapped to 1 and 0, respectively. Each of the original class values is converted
into a 4-bit code word, 1 bit per class, and the four classifiers predict the
bits independently. Interpreting the classification process in terms of these
code words, errors occur when the wrong binary bit receives the highest
confidence.
334 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT