Pattern Recognition and Machine Learning

(Jeff_L) #1
72 2. PROBABILITY DISTRIBUTIONS

μ

a=0. 1
b=0. 1

0 0.5 1

0

1

2

3

μ

a=1
b=1

0 0.5 1

0

1

2

3

μ

a=2
b=3

0 0.5 1

0

1

2

3

μ

a=8
b=4

0 0.5 1

0

1

2

3

Figure 2.2 Plots of the beta distributionBeta(μ|a, b)given by (2.13) as a function ofμfor various values of the
hyperparametersaandb.


wherel=N−m, and therefore corresponds to the number of ‘tails’ in the coin
example. We see that (2.17) has the same functional dependence onμas the prior
distribution, reflecting the conjugacy properties of the prior with respect to the like-
lihood function. Indeed, it is simply another beta distribution, and its normalization
coefficient can therefore be obtained by comparison with (2.13) to give

p(μ|m, l, a, b)=

Γ(m+a+l+b)
Γ(m+a)Γ(l+b)

μm+a−^1 (1−μ)l+b−^1. (2.18)

We see that the effect of observing a data set ofmobservations ofx=1and
lobservations ofx=0has been to increase the value ofabym, and the value of
bbyl, in going from the prior distribution to the posterior distribution. This allows
us to provide a simple interpretation of the hyperparametersaandbin the prior as
aneffective number of observationsofx=1andx=0, respectively. Note that
aandbneed not be integers. Furthermore, the posterior distribution can act as the
prior if we subsequently observe additional data. To see this, we can imagine taking
observations one at a time and after each observation updating the current posterior
Free download pdf