Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
is not necessarily the one that performs best on the training data. We have
repeatedly cautioned about the problem of overfitting, where a learned model
is too closely tied to the particular training data from which it was built. It is
incorrect to assume that performance on the training data faithfully represents
the level of performance that can be expected on the fresh data to which the
learned model will be applied in practice.
Fortunately, we have already encountered the solution to this problem in
Chapter 5. There are two good methods for estimating the expected true per-
formance of a learning scheme: the use of a large dataset that is quite separate
from the training data, in the case of plentiful data, and cross-validation
(Section 5.3), if data is scarce. In the latter case, a single 10-fold cross-
validation is typically used in practice, although to obtain a more reliable
estimate the entire procedure should be repeated 10 times. Once suitable param-
eters have been chosen for the learning scheme, use the whole training set—all
the available training instances—to produce the final learned model that is to
be applied to fresh data.
Note that the performance obtained with the chosen parameter value during
the tuning process is nota reliable estimate of the final model’s performance,
because the final model potentially overfits the data that was used for tuning.
To ascertain how well it will perform, you need yet another large dataset that is
quite separate from any data used during learning and tuning. The same is true
for cross-validation: you need an “inner” cross-validation for parameter tuning
and an “outer” cross-validation for error estimation. With 10-fold cross-
validation, this involves running the learning scheme 100 times. To summarize:
when assessing the performance of a learning scheme, any parameter tuning
that goes on should be treated as though it were an integral part of the train-
ing process.
There are other important processes that can materially improve success
when applying machine learning techniques to practical data mining problems,
and these are the subject of this chapter. They constitute a kind of data engi-
neering: engineering the input data into a form suitable for the learning scheme
chosen and engineering the output model to make it more effective. You can
look on them as a bag of tricks that you can apply to practical data mining prob-
lems to enhance the chance of success. Sometimes they work; other times they
don’t—and at the present state of the art, it’s hard to say in advance whether
they will or not. In an area such as this where trial and error is the most reli-
able guide, it is particularly important to be resourceful and understand what
the tricks are.
We begin by examining four different ways in which the input can be mas-
saged to make it more amenable for learning methods: attribute selection,
attribute discretization, data transformation, and data cleansing. Consider the
first, attribute selection. In many practical situations there are far too many

286 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT

Free download pdf