Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
would have turned out to be useful in the learning process by using gradations
that are too coarse or by unfortunate choices of boundary that needlessly lump
together many instances of different classes.
Equal-interval binning often distributes instances very unevenly: some bins
contain many instances, and others contain none. This can seriously impair the
ability of the attribute to help to build good decision structures. It is often better
to allow the intervals to be of different sizes, choosing them so that the same
number of training examples fall into each one. This method,equal-frequency
binning,divides the attribute’s range into a predetermined number of bins based
on the distribution of examples along that axis—sometimes called histogram
equalization,because if you take a histogram of the contents of the resulting
bins it will be completely flat. If you view the number of bins as a resource, this
method makes best use of it.
However, equal-frequency binning is still oblivious to the instances’ classes,
and this can cause bad boundaries. For example, if all instances in a bin have
one class, and all instances in the next higher bin have another except for the
first, which has the original class, surely it makes sense to respect the class
divisions and include that first instance in the previous bin, sacrificing the equal-
frequency property for the sake of homogeneity. Supervised discretization—
taking classes into account during the process—certainly has advantages.
Nevertheless, it has been found that equal-frequency binning can yield excellent
results, at least in conjunction with the Naïve Bayes learning scheme, when the
number of bins is chosen in a data-dependent fashion by setting it to the square
root of the number of instances. This method is called proportional k-interval
discretization.

Entropy-based discretization


Because the criterion used for splitting a numeric attribute during the forma-
tion of a decision tree works well in practice, it seems a good idea to extend it
to more general discretization by recursively splitting intervals until it is time
to stop. In Chapter 6 we saw how to sort the instances by the attribute’s value
and consider, for each possible splitting point, the information gain of the
resulting split. To discretize the attribute, once the first split is determined the
splitting process can be repeated in the upper and lower parts of the range, and
so on, recursively.
To see this working in practice, we revisit the example on page 189 for dis-
cretizing the temperature attribute of the weather data, whose values are
64 65 68 69 70 71 72 75 80 81 83 85
yes no yes yes yes no no yes
yes yes no yes yes no

298 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT

Free download pdf