2.2. Multinomial Variables 75- So, for instance if we have a variable that can takeK=6states and a particular
observation of the variable happens to correspond to the state wherex 3 =1, thenx
will be represented by
x=(0, 0 , 1 , 0 , 0 ,0)T. (2.25)
Note that such vectors satisfy∑K
k=1xk=1. If we denote the probability ofxk=1
by the parameterμk, then the distribution ofxis givenp(x|μ)=∏Kk=1μxkk (2.26)whereμ=(μ 1 ,...,μK)T, and the parametersμkare constrained to satisfyμk 0
and∑
kμk=1, because they represent probabilities. The distribution (2.26) can be
regarded as a generalization of the Bernoulli distribution to more than two outcomes.
It is easily seen that the distribution is normalized∑xp(x|μ)=∑Kk=1μk=1 (2.27)and that
E[x|μ]=∑xp(x|μ)x=(μ 1 ,...,μM)T=μ. (2.28)Now consider a data setDofN independent observationsx 1 ,...,xN. The
corresponding likelihood function takes the formp(D|μ)=∏Nn=1∏Kk=1μxknk=∏Kk=1μ(
P
nxnk)
k =∏Kk=1μmkk. (2.29)We see that the likelihood function depends on theNdata points only through the
Kquantities
mk=∑nxnk (2.30)which represent the number of observations ofxk=1. These are called thesufficient
Section 2.4 statisticsfor this distribution.
In order to find the maximum likelihood solution forμ, we need to maximize
lnp(D|μ)with respect toμktaking account of the constraint that theμkmust sum
Appendix E to one. This can be achieved using a Lagrange multiplierλand maximizing
∑Kk=1mklnμk+λ(K
∑k=1μk− 1). (2.31)
Setting the derivative of (2.31) with respect toμkto zero, we obtainμk=−mk/λ. (2.32)