Domingos (1997) describes how to derive a single interpretable model from
an ensemble using artificial training examples. Bayesian option trees were intro-
duced by Buntine (1992), and majority voting was incorporated into option
trees by Kohavi and Kunz (1997). Freund and Mason (1999) introduced alter-
nating decision trees; experiments with multiclass alternating decision trees
were reported by Holmes et al. (2002). Landwehr et al. (2003) developed logis-
tic model trees using the LogitBoost algorithm.
Stacked generalization originated with Wolpert (1992), who presented the
idea in the neural network literature, and was applied to numeric prediction by
Breiman (1996a). Ting and Witten (1997a) compared different level-1 models
empirically and found that a simple linear model performs best; they also
demonstrated the advantage of using probabilities as level-1 data. A combina-
tion of stacking and bagging has also been investigated (Ting and Witten
1997b).
The idea of using error-correcting output codes for classification gained wide
acceptance after a paper by Dietterich and Bakiri (1995); Ricci and Aha (1998)
showed how to apply such codes to nearest-neighbor classifiers.
Blum and Mitchell (1998) pioneered the use of co-training and developed a
theoretical model for the use of labeled and unlabeled data from different inde-
pendent perspectives. Nigam and Ghani (2000) analyzed the effectiveness and
applicability of co-training, relating it to the traditional use of standard EM to
fill in missing values. They also introduced the co-EM algorithm. Nigam et al.
(2000) thoroughly explored how the EM clustering algorithm can use unlabeled
data to improve an initial classifier built by Naïve Bayes, as reported in the
Clustering for classificationsection. Up to this point, co-training and co-EM
were applied mainly to small two-class problems; Ghani (2002) used error-
correcting output codes to address multiclass situations with many classes.
Brefeld and Scheffer (2004) extended co-EM to use a support vector machine
rather than Naïve Bayes. Seeger (2001) casts some doubt on whether these new
algorithms really do have anything to offer over traditional ones, properly used.
7.7 FURTHER READING 343