Pattern Recognition and Machine Learning

76 2. PROBABILITY DISTRIBUTIONS

We can solve for the Lagrange multiplier∑ λby substituting (2.32) into the constraint kμk=1to giveλ=−N. Thus we obtain the maximum likelihood solution in the form μMLk =

mk N

(2.33)

which is the fraction of theNobservations for whichxk=1. We can consider the joint distribution of the quantitiesm 1 ,...,mK, conditioned on the parametersμand on the total numberNof observations. From (2.29) this takes the form

Mult(m 1 ,m 2 ,...,mK|μ,N)=

( N m 1 m 2 ...mK

)∏K

k=1

μmkk (2.34)

which is known as themultinomialdistribution. The normalization coefficient is the number of ways of partitioningNobjects intoKgroups of sizem 1 ,...,mKand is given by ( N m 1 m 2 ...mK

) =

N!

m 1 !m 2 !...mK!

. (2.35)

Note that the variablesmkare subject to the constraint

∑K

k=1

mk=N. (2.36)

2.2.1 The Dirichlet distribution

We now introduce a family of prior distributions for the parameters{μk}of the multinomial distribution (2.34). By inspection of the form of the multinomial distribution, we see that the conjugate prior is given by

p(μ|α)∝

∏K

k=1

μαkk−^1 (2.37)

where 0 μk 1 and

∑
kμk=1. Hereα^1 ,...,αKare the parameters of the
distribution, andαdenotes(α 1 ,...,αK)T. Note that, because of the summation
constraint, the distribution over the space of the{μk}is confined to asimplexof
dimensionalityK− 1 , as illustrated forK=3in Figure 2.4.
Exercise 2.9 The normalized form for this distribution is by

Dir(μ|α)=

Γ(α 0 ) Γ(α 1 )···Γ(αK)

∏K

k=1

μαkk−^1 (2.38)

which is called theDirichletdistribution. HereΓ(x)is the gamma function defined by (1.141) while

α 0 =

∑K

k=1

αk. (2.39)

Pattern Recognition and Machine Learning

76 2. PROBABILITY DISTRIBUTIONS

(2.33)

N!

. (2.35)

2.2.1 The Dirichlet distribution

Get our desktop app

Company

Features

Documentation

Resources