Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

Calculating information


Now it is time to explain how to calculate the information measure that is used
as a basis for evaluating different splits. We describe the basic idea in this section,
then in the next we examine a correction that is usually made to counter a bias
toward selecting splits on attributes with large numbers of possible values.
Before examining the detailed formula for calculating the amount of infor-
mation required to specify the class of an example given that it reaches a tree
node with a certain number ofyes’s and no’s, consider first the kind of proper-
ties we would expect this quantity to have:

100 CHAPTER 4| ALGORITHMS: THE BASIC METHODS


... ...

no
no yes

sunny

hot mild cool

outlook

temperature

yes
no
(a)

... ...

no
no
no

yes
yes

sunny

high normal

outlook

humidity

(b)

... ...

yes
yes
no
no

yes
no

sunny

false true

outlook

windy

(c)
Figure 4.3Expanded tree stumps for the weather data.
Free download pdf