Pattern Recognition and Machine Learning

(Jeff_L) #1
52 1. INTRODUCTION

probabilities

H = 1.77

0

0.25

0.5

probabilities

H = 3.09

0

0.25

0.5

Figure 1.30 Histograms of two probability distributions over 30 bins illustrating the higher value of the entropy
Hfor the broader distribution. The largest entropy would arise from a uniform distribution that would giveH=
−ln(1/30) = 3. 40.


from which we find that all of thep(xi)are equal and are given byp(xi)=1/M
whereMis the total number of statesxi. The corresponding value of the entropy
is thenH=lnM. This result can also be derived from Jensen’s inequality (to be
Exercise 1.29 discussed shortly). To verify that the stationary point is indeed a maximum, we can
evaluate the second derivative of the entropy, which gives


∂H ̃
∂p(xi)∂p(xj)

=−Iij

1

pi

(1.100)

whereIijare the elements of the identity matrix.
We can extend the definition of entropy to include distributionsp(x)over con-
tinuous variablesxas follows. First dividexinto bins of width∆. Then, assuming
p(x)is continuous, themean value theorem(Weisstein, 1999) tells us that, for each
such bin, there must exist a valuexisuch that
∫(i+1)∆

i∆

p(x)dx=p(xi)∆. (1.101)

We can now quantize the continuous variablexby assigning any valuexto the value
xiwheneverxfalls in theithbin. The probability of observing the valuexiis then
p(xi)∆. This gives a discrete distribution for which the entropy takes the form

H∆=−


i

p(xi)∆ ln (p(xi)∆) =−


i

p(xi)∆ lnp(xi)−ln ∆ (1.102)

where we have used


ip(xi)∆ = 1, which follows from (1.101). We now omit
the second term−ln ∆on the right-hand side of (1.102) and then consider the limit
Free download pdf