Pattern Recognition and Machine Learning

(Jeff_L) #1
76 2. PROBABILITY DISTRIBUTIONS

We can solve for the Lagrange multiplier∑ λby substituting (2.32) into the constraint
kμk=1to giveλ=−N. Thus we obtain the maximum likelihood solution in
the form
μMLk =

mk
N

(2.33)

which is the fraction of theNobservations for whichxk=1.
We can consider the joint distribution of the quantitiesm 1 ,...,mK, conditioned
on the parametersμand on the total numberNof observations. From (2.29) this
takes the form

Mult(m 1 ,m 2 ,...,mK|μ,N)=

(
N
m 1 m 2 ...mK

)∏K

k=1

μmkk (2.34)

which is known as themultinomialdistribution. The normalization coefficient is the
number of ways of partitioningNobjects intoKgroups of sizem 1 ,...,mKand is
given by (
N
m 1 m 2 ...mK

)
=

N!

m 1 !m 2 !...mK!

. (2.35)

Note that the variablesmkare subject to the constraint

∑K

k=1

mk=N. (2.36)

2.2.1 The Dirichlet distribution


We now introduce a family of prior distributions for the parameters{μk}of
the multinomial distribution (2.34). By inspection of the form of the multinomial
distribution, we see that the conjugate prior is given by

p(μ|α)∝

∏K

k=1

μαkk−^1 (2.37)

where 0 μk  1 and


kμk=1. Hereα^1 ,...,αKare the parameters of the
distribution, andαdenotes(α 1 ,...,αK)T. Note that, because of the summation
constraint, the distribution over the space of the{μk}is confined to asimplexof
dimensionalityK− 1 , as illustrated forK=3in Figure 2.4.
Exercise 2.9 The normalized form for this distribution is by


Dir(μ|α)=

Γ(α 0 )
Γ(α 1 )···Γ(αK)

∏K

k=1

μαkk−^1 (2.38)

which is called theDirichletdistribution. HereΓ(x)is the gamma function defined
by (1.141) while

α 0 =

∑K

k=1

αk. (2.39)
Free download pdf