Pattern Recognition and Machine Learning

(Jeff_L) #1
54 1. INTRODUCTION

three constraints
∫∞

−∞

p(x)dx =1 (1.105)
∫∞

−∞

xp(x)dx = μ (1.106)
∫∞

−∞

(x−μ)^2 p(x)dx = σ^2. (1.107)

Appendix E The constrained maximization can be performed using Lagrange multipliers so that
we maximize the following functional with respect top(x)



∫∞

−∞

p(x)lnp(x)dx+λ 1

(∫∞

−∞

p(x)dx− 1

)

+λ 2

(∫∞

−∞

xp(x)dx−μ

)
+λ 3

(∫∞

−∞

(x−μ)^2 p(x)dx−σ^2

)
.

Appendix D Using the calculus of variations, we set the derivative of this functional to zero giving


p(x)=exp

{
−1+λ 1 +λ 2 x+λ 3 (x−μ)^2

}

. (1.108)


The Lagrange multipliers can be found by back substitution of this result into the
Exercise 1.34 three constraint equations, leading finally to the result


p(x)=

1

(2πσ^2 )^1 /^2

exp

{

(x−μ)^2
2 σ^2

}
(1.109)

and so the distribution that maximizes the differential entropy is the Gaussian. Note
that we did not constrain the distribution to be nonnegative when we maximized the
entropy. However, because the resulting distribution is indeed nonnegative, we see
with hindsight that such a constraint is not necessary.
Exercise 1.35 If we evaluate the differential entropy of the Gaussian, we obtain


H[x]=

1

2

{
1+ln(2πσ^2 )

}

. (1.110)


Thus we see again that the entropy increases as the distribution becomes broader,
i.e., asσ^2 increases. This result also shows that the differential entropy, unlike the
discrete entropy, can be negative, becauseH(x)< 0 in (1.110) forσ^2 < 1 /(2πe).
Suppose we have a joint distributionp(x,y)from which we draw pairs of values
ofxandy. If a value ofxis already known, then the additional information needed
to specify the corresponding value ofyis given by−lnp(y|x). Thus the average
additional information needed to specifyycan be written as

H[y|x]=−

∫∫
p(y,x)lnp(y|x)dydx (1.111)
Free download pdf