Pattern Recognition and Machine Learning

(Jeff_L) #1
1.6. Information Theory 51

the number of different ways of allocating the objects to the bins. There areN
ways to choose the first object,(N−1)ways to choose the second object, and
so on, leading to a total ofN!ways to allocate allNobjects to the bins, whereN!
(pronounced ‘factorialN’) denotes the productN×(N−1)×···× 2 × 1. However,
we don’t wish to distinguish between rearrangements of objects within each bin. In
theithbin there areni!ways of reordering the objects, and so the total number of
ways of allocating theNobjects to the bins is given by

W=

N!


ini!

(1.94)

which is called themultiplicity. The entropy is then defined as the logarithm of the
multiplicity scaled by an appropriate constant

H=

1

N

lnW=

1

N

lnN!−

1

N


i

lnni!. (1.95)

We now consider the limitN→∞, in which the fractionsni/Nare held fixed, and
apply Stirling’s approximation

lnN!NlnN−N (1.96)

which gives
H=− lim
N→∞


i

(n
i
N

)
ln

(n
i
N

)
=−


i

pilnpi (1.97)

where we have used


ini=N. Herepi= limN→∞(ni/N)is the probability
of an object being assigned to theithbin. In physics terminology, the specific ar-
rangements of objects in the bins is called amicrostate, and the overall distribution
of occupation numbers, expressed through the ratiosni/N, is called amacrostate.
The multiplicityWis also known as theweightof the macrostate.
We can interpret the bins as the statesxiof a discrete random variableX, where
p(X=xi)=pi. The entropy of the random variableXis then

H[p]=−


i

p(xi)lnp(xi). (1.98)

Distributionsp(xi)that are sharply peaked around a few values will have a relatively
low entropy, whereas those that are spread more evenly across many values will
have higher entropy, as illustrated in Figure 1.30. Because 0 pi 1 , the entropy
is nonnegative, and it will equal its minimum value of 0 when one of thepi =
1 and all otherpj=i =0. The maximum entropy configuration can be found by
Appendix E maximizingHusing a Lagrange multiplier to enforce the normalization constraint
on the probabilities. Thus we maximize


H= ̃ −


i

p(xi)lnp(xi)+λ

(

i

p(xi)− 1

)

(1.99)
Free download pdf