Pattern Recognition and Machine Learning

(Jeff_L) #1
2.2. Multinomial Variables 75


  1. So, for instance if we have a variable that can takeK=6states and a particular
    observation of the variable happens to correspond to the state wherex 3 =1, thenx
    will be represented by
    x=(0, 0 , 1 , 0 , 0 ,0)T. (2.25)


Note that such vectors satisfy

∑K
k=1xk=1. If we denote the probability ofxk=1
by the parameterμk, then the distribution ofxis given

p(x|μ)=

∏K

k=1

μxkk (2.26)

whereμ=(μ 1 ,...,μK)T, and the parametersμkare constrained to satisfyμk 0
and


kμk=1, because they represent probabilities. The distribution (2.26) can be
regarded as a generalization of the Bernoulli distribution to more than two outcomes.
It is easily seen that the distribution is normalized


x

p(x|μ)=

∑K

k=1

μk=1 (2.27)

and that
E[x|μ]=


x

p(x|μ)x=(μ 1 ,...,μM)T=μ. (2.28)

Now consider a data setDofN independent observationsx 1 ,...,xN. The
corresponding likelihood function takes the form

p(D|μ)=

∏N

n=1

∏K

k=1

μxknk=

∏K

k=1

μ

(

P
nxnk)
k =

∏K

k=1

μmkk. (2.29)

We see that the likelihood function depends on theNdata points only through the
Kquantities
mk=


n

xnk (2.30)

which represent the number of observations ofxk=1. These are called thesufficient
Section 2.4 statisticsfor this distribution.
In order to find the maximum likelihood solution forμ, we need to maximize
lnp(D|μ)with respect toμktaking account of the constraint that theμkmust sum
Appendix E to one. This can be achieved using a Lagrange multiplierλand maximizing


∑K

k=1

mklnμk+λ

(K

k=1

μk− 1

)

. (2.31)


Setting the derivative of (2.31) with respect toμkto zero, we obtain

μk=−mk/λ. (2.32)
Free download pdf