9.3. An Alternative View of EM 445Consider a set ofDbinary variablesxi, wherei=1,...,D, each of which is
governed by a Bernoulli distribution with parameterμi, so thatp(x|μ)=∏Di=1μxii(1−μi)(1−xi) (9.44)wherex =(x 1 ,...,xD)Tandμ =(μ 1 ,...,μD)T. We see that the individual
variablesxiare independent, givenμ. The mean and covariance of this distribution
are easily seen to beE[x]=μ (9.45)
cov[x]=diag{μi(1−μi)}. (9.46)Now let us consider a finite mixture of these distributions given byp(x|μ,π)=∑Kk=1πkp(x|μk) (9.47)whereμ={μ 1 ,...,μK},π={π 1 ,...,πK}, andp(x|μk)=∏Di=1μxkii(1−μki)(1−xi). (9.48)Exercise 9.12 The mean and covariance of this mixture distribution are given by
E[x]=∑Kk=1πkμk (9.49)cov[x]=∑Kk=1πk{
Σk+μkμTk}
−E[x]E[x]T (9.50)whereΣk =diag{μki(1−μki)}. Because the covariance matrixcov[x]is no
longer diagonal, the mixture distribution can capture correlations between the vari-
ables, unlike a single Bernoulli distribution.
If we are given a data setX={x 1 ,...,xN}then the log likelihood function
for this model is given bylnp(X|μ,π)=∑Nn=1ln{K
∑k=1πkp(xn|μk)}. (9.51)
Again we see the appearance of the summation inside the logarithm, so that the
maximum likelihood solution no longer has closed form.
We now derive the EM algorithm for maximizing the likelihood function for
the mixture of Bernoulli distributions. To do this, we first introduce an explicit latent