Of course, there is nothing special about these particular numbers, and a similar
relationship must hold regardless of the actual values. Thus we can add a further
criterion to the preceding list:
- The information must obey the multistage property illustrated previously.
Remarkably, it turns out that there is only one function that satisfies all these
properties, and it is known as the information valueor entropy:
The reason for the minus signs is that logarithms of the fractions p 1 ,p 2 ,...,pn
are negative, so the entropy is actually positive. Usually the logarithms are
expressed in base 2, then the entropy is in units called bits—just the usual kind
of bits used with computers.
The arguments p 1 ,p 2 ,...ofthe entropy formula are expressed as fractions
that add up to one, so that, for example,
Thus the multistage decision property can be written in general as
where p+q+r=1.
Because of the way the log function works, you can calculate the information
measure without having to work out the individual fractions:
This is the way that the information measure is usually calculated in
practice. So the information value for the first leaf node of the first tree in Figure
4.2 is
as stated on page 98.
Highly branching attributes
When some attributes have a large number of possible values, giving rise to a
multiway branch with many child nodes, a problem arises with the information
gain calculation. The problem can best be appreciated in the extreme case when
an attribute has a different value for each instance in the dataset—as, for
example, an identification codeattribute might.
info 2, 3([])=- ¥ 25 log25 35- ¥log35 0971=. bits,
info 2, 3, 4([])=- ¥ - ¥ - ¥
=- - - +[]
29 29 39 39 49 49
223344999
log log log
log log log log.
entropypqrentropy pqr qrentropy
q
qr
r
qr
( ,,)=+( , )++( )◊ ,
++
Ê
Ë
ˆ
̄
info 2, 3, 4([])=entropy 2 9( ,, .3949)
entropy(pp 12 , ,... ,pnnn)=-p 1 logp p 1 - 2 logp 2.. .-plogp
102 CHAPTER 4| ALGORITHMS: THE BASIC METHODS