Pattern Recognition and Machine Learning

1.6. Information Theory 49

y−t

|y

−

qt|

q=0. 3

−2 −1 0 1 2

0

1

2

y−t

|y

−

qt|

q=1

−2 −1 0 1 2

0

1

2

y−t

|y

−

qt|

q=2

−2 −1 0 1 2

0

1

2

y−t

|y

−

qt|

q=10

−2 −1 0 1 2

0

1

2

Figure 1.29 Plots of the quantityLq=|y−t|qfor various values ofq.

h(x)=−log 2 p(x) (1.92)

where the negative sign ensures that information is positive or zero. Note that low probability eventsxcorrespond to high information content. The choice of basis for the logarithm is arbitrary, and for the moment we shall adopt the convention prevalent in information theory of using logarithms to the base of 2. In this case, as we shall see shortly, the units ofh(x)are bits (‘binary digits’). Now suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distributionp(x)and is given by H[x]=−

∑

x

p(x) log 2 p(x). (1.93)

This important quantity is called theentropyof the random variablex. Note that limp→ 0 plnp=0and so we shall takep(x)lnp(x)=0whenever we encounter a value forxsuch thatp(x)=0. So far we have given a rather heuristic motivation for the definition of informa-

Pattern Recognition and Machine Learning

Get our desktop app

Company

Features

Documentation

Resources