Pattern Recognition and Machine Learning

(Jeff_L) #1
1.6. Information Theory 49

y−t

|y


qt|

q=0. 3

−2 −1 0 1 2

0

1

2

y−t

|y


qt|

q=1

−2 −1 0 1 2

0

1

2

y−t

|y


qt|


q=2

−2 −1 0 1 2

0

1

2

y−t

|y


qt|

q=10

−2 −1 0 1 2

0

1

2

Figure 1.29 Plots of the quantityLq=|y−t|qfor various values ofq.

h(x)=−log 2 p(x) (1.92)

where the negative sign ensures that information is positive or zero. Note that low
probability eventsxcorrespond to high information content. The choice of basis
for the logarithm is arbitrary, and for the moment we shall adopt the convention
prevalent in information theory of using logarithms to the base of 2. In this case, as
we shall see shortly, the units ofh(x)are bits (‘binary digits’).
Now suppose that a sender wishes to transmit the value of a random variable to
a receiver. The average amount of information that they transmit in the process is
obtained by taking the expectation of (1.92) with respect to the distributionp(x)and
is given by
H[x]=−


x

p(x) log 2 p(x). (1.93)

This important quantity is called theentropyof the random variablex. Note that
limp→ 0 plnp=0and so we shall takep(x)lnp(x)=0whenever we encounter a
value forxsuch thatp(x)=0.
So far we have given a rather heuristic motivation for the definition of informa-
Free download pdf