Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

7.2 DISCRETIZING NUMERIC ATTRIBUTES 299


(Repeated values have been collapsed together.) The information gain for each
of the 11 possible positions for the breakpoint is calculated in the usual way.
For example, the information value of the test temperature<71.5, which splits
the range into fouryes’s and two no’s versus fiveyes’s and three no’s, is


This represents the amount of information required to specify the individual
values ofyesand nogiven the split. We seek a discretization that makes the
subintervals as pure as possible; hence, we choose to split at the point where the
information value is smallest. (This is the same as splitting where the informa-
tion gain,defined as the difference between the information value without the
split and that with the split, is largest.) As before, we place numeric thresholds
halfway between the values that delimit the boundaries of a concept.
The graph labeled A in Figure 7.2 shows the information values at each pos-
sible cut point at this first stage. The cleanest division—smallest information
value—is at a temperature of 84 (0.827 bits), which separates off just the very
final value, a noinstance, from the preceding list. The instance classes are written
below the horizontal axis to make interpretation easier. Invoking the algorithm
again on the lower range of temperatures, from 64 to 83, yields the graph labeled
B. This has a minimum at 80.5 (0.800 bits), which splits off the next two values,


info 4, 2()[][],,=()¥info 4, 2()[]+()¥info 5, 3()[]= .5 3bits 6 14 8 14 0 939

0.4


0.2


0.6


0

0.8


1

65 70 75 80 85

C B A
D

E
F

yesnoyesyesyesnoyesno yesyes noyesyesno

Figure 7.2Discretizing the temperatureattribute using the entropy method.

Free download pdf