Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
The overall effect is that the information gain measure tends to prefer attri-
butes with large numbers of possible values. To compensate for this, a modifi-
cation of the measure called the gain ratiois widely used. The gain ratio is
derived by taking into account the number and size of daughter nodes into
which an attribute splits the dataset, disregarding any information about the
class. In the situation shown in Figure 4.5, all counts have a value of 1, so the
information value of the split is

because the same fraction, 1/14, appears 14 times. This amounts to log 14, or
3.807 bits, which is a very high value. This is because the information value of
a split is the number of bits needed to determine to which branch each instance
is assigned, and the more branches there are, the greater this value is. The gain
ratio is calculated by dividing the original information gain, 0.940 in this case,
by the information value of the attribute, 3.807—yielding a gain ratio value of
0.247 for the ID codeattribute.
Returning to the tree stumps for the weather data in Figure 4.2,outlooksplits
the dataset into three subsets of size 5, 4, and 5 and thus has an intrinsic infor-
mation value of

without paying any attention to the classes involved in the subsets. As we have
seen, this intrinsic information value is higher for a more highly branching
attribute such as the hypothesizedID code.Again we can correct the informa-
tion gain by dividing by the intrinsic information value to get the gain ratio.
The results of these calculations for the tree stumps of Figure 4.2 are sum-
marized in Table 4.7.Outlookstill comes out on top, but humidityis now a much
closer contender because it splits the data into two subsets instead of three. In
this particular example, the hypothetical ID codeattribute, with a gain ratio of
0.247, would still be preferred to any of these four. However, its advantage is

info 5, 4,5([])=1 577.

info 1,1,... ,1([])=- ¥1 14 log1 14 14¥ ,

104 CHAPTER 4| ALGORITHMS: THE BASIC METHODS


Table 4.7 Gain ratio calculations for the tree stumps of Figure 4.2.

Outlook Temperature Humidity Windy

info: 0.693 info: 0.911 info: 0.788 info: 0.892
gain: 0.940– 0.247 gain: 0.940– 0.029 gain: 0.940– 0.152 gain: 0.940– 0.048
0.693 0.911 0.788 0.892
split info: 1.577 split info: 1.557 split info: 1.000 split info: 0.985
info([5,4,5]) info([4,6,4]) info ([7,7]) info([8,6])
gain ratio: 0.157 gain ratio: 0.019 gain ratio: 0.152 gain ratio: 0.049
0.247/1.577 0.029/1.557 0.152/1 0.048/0.985

Free download pdf