Pattern Recognition and Machine Learning

2.2. Multinomial Variables 75

So, for instance if we have a variable that can takeK=6states and a particular
observation of the variable happens to correspond to the state wherex 3 =1, thenx
will be represented by
x=(0, 0 , 1 , 0 , 0 ,0)T. (2.25)

Note that such vectors satisfy

∑K k=1xk=1. If we denote the probability ofxk=1 by the parameterμk, then the distribution ofxis given

p(x|μ)=

∏K

k=1

μxkk (2.26)

whereμ=(μ 1 ,...,μK)T, and the parametersμkare constrained to satisfyμk 0 and

∑ kμk=1, because they represent probabilities. The distribution (2.26) can be regarded as a generalization of the Bernoulli distribution to more than two outcomes. It is easily seen that the distribution is normalized

∑

x

p(x|μ)=

∑K

k=1

μk=1 (2.27)

and that E[x|μ]=

∑

x

p(x|μ)x=(μ 1 ,...,μM)T=μ. (2.28)

Now consider a data setDofN independent observationsx 1 ,...,xN. The corresponding likelihood function takes the form

p(D|μ)=

∏N

n=1

∏K

k=1

μxknk=

∏K

k=1

μ

(

P nxnk) k =

∏K

k=1

μmkk. (2.29)

We see that the likelihood function depends on theNdata points only through the Kquantities mk=

∑

n

xnk (2.30)

which represent the number of observations ofxk=1. These are called thesufficient
Section 2.4 statisticsfor this distribution.
In order to find the maximum likelihood solution forμ, we need to maximize
lnp(D|μ)with respect toμktaking account of the constraint that theμkmust sum
Appendix E to one. This can be achieved using a Lagrange multiplierλand maximizing

∑K

k=1

mklnμk+λ

(K ∑

k=1

μk− 1

)

. (2.31)

Setting the derivative of (2.31) with respect toμkto zero, we obtain

μk=−mk/λ. (2.32)

Pattern Recognition and Machine Learning

(

Get our desktop app

Company

Features

Documentation

Resources