Pattern Recognition and Machine Learning

52 1. INTRODUCTION

probabilities

H = 1.77

0

0.25

0.5

probabilities

H = 3.09

0

0.25

0.5

Figure 1.30 Histograms of two probability distributions over 30 bins illustrating the higher value of the entropy
Hfor the broader distribution. The largest entropy would arise from a uniform distribution that would giveH=
−ln(1/30) = 3. 40.

from which we find that all of thep(xi)are equal and are given byp(xi)=1/M
whereMis the total number of statesxi. The corresponding value of the entropy
is thenH=lnM. This result can also be derived from Jensen’s inequality (to be
Exercise 1.29 discussed shortly). To verify that the stationary point is indeed a maximum, we can
evaluate the second derivative of the entropy, which gives

∂H ̃ ∂p(xi)∂p(xj)

=−Iij

1

pi

(1.100)

whereIijare the elements of the identity matrix. We can extend the definition of entropy to include distributionsp(x)over continuous variablesxas follows. First dividexinto bins of width∆. Then, assuming p(x)is continuous, themean value theorem(Weisstein, 1999) tells us that, for each such bin, there must exist a valuexisuch that ∫(i+1)∆

i∆

p(x)dx=p(xi)∆. (1.101)

We can now quantize the continuous variablexby assigning any valuexto the value xiwheneverxfalls in theithbin. The probability of observing the valuexiis then p(xi)∆. This gives a discrete distribution for which the entropy takes the form

H∆=−

∑

i

p(xi)∆ ln (p(xi)∆) =−

∑

i

p(xi)∆ lnp(xi)−ln ∆ (1.102)

where we have used

∑ ip(xi)∆ = 1, which follows from (1.101). We now omit the second term−ln ∆on the right-hand side of (1.102) and then consider the limit

Pattern Recognition and Machine Learning

52 1. INTRODUCTION

1

(1.100)

Get our desktop app

Company

Features

Documentation

Resources