Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
Dougherty et al. (1995) give a brief account of supervised and unsupervised
discretization, along with experimental results comparing the entropy-based
method with equal-width binning and the 1R method. Frank and Witten (1999)
describe the effect of using the ordering information in discretized attributes.
Proportional k-interval discretization for Naïve Bayes was proposed by Yang and
Webb (2001). The entropy-based method for discretization, including the use
of the MDL stopping criterion, was developed by Fayyad and Irani (1993). The
bottom-up statistical method using the c^2 test is due to Kerber (1992), and its
extension to an automatically determined significance level is described by Liu
and Setiono (1997). Fulton et al. (1995) investigate the use of dynamic pro-
gramming for discretization and derive the quadratic time bound for a general
impurity function (e.g., entropy) and the linear one for error-based discretiza-
tion. The example used for showing the weakness of error-based discretization
is adapted from Kohavi and Sahami (1996), who were the first to clearly iden-
tify this phenomenon.
Principal components analysis is a standard technique that can be found in
most statistics textbooks. Fradkin and Madigan (2003) analyze the performance
of random projections. The TF ¥IDF metric is described by Witten et al.
(1999b).
The experiments on using C4.5 to filter its own training data were reported
by John (1995). The more conservative approach of a consensus filter involving
several learning algorithms has been investigated by Brodley and Friedl
(1996). Rousseeuw and Leroy (1987) describe the detection of outliers in sta-
tistical regression, including the least median of squares method; they also
present the telephone data of Figure 7.6. It was Quinlan (1986) who noticed
that removing noise from the training instance’s attributes can decrease a
classifier’s performance on similarly noisy test instances, particularly at higher
noise levels.
Combining multiple models is a popular research topic in machine learning
research, with many related publications. The term bagging(for “bootstrap
aggregating”) was coined by Breiman (1996b), who investigated the properties
of bagging theoretically and empirically for both classification and numeric pre-
diction. Domingos (1999) introduced the MetaCost algorithm. Randomization
was evaluated by Dietterich (2000) and compared with bagging and boosting.
Bay (1999) suggests using randomization for ensemble learning with nearest-
neighbor classifiers. Random forests were introduced by Breiman (2001).
Freund and Schapire (1996) developed the AdaBoost.M1 boosting algorithm
and derived theoretical bounds for its performance. Later, they improved these
bounds using the concept of margins (Freund and Schapire 1999). Drucker
(1997) adapted AdaBoost.M1 for numeric prediction. The LogitBoost algorithm
was developed by Friedman et al. (2000). Friedman (2001) describes how to
make boosting more resilient in the presence of noisy data.

342 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT

Free download pdf