Pattern Recognition and Machine Learning

(Jeff_L) #1
9.3. An Alternative View of EM 445

Consider a set ofDbinary variablesxi, wherei=1,...,D, each of which is
governed by a Bernoulli distribution with parameterμi, so that

p(x|μ)=

∏D

i=1

μxii(1−μi)(1−xi) (9.44)

wherex =(x 1 ,...,xD)Tandμ =(μ 1 ,...,μD)T. We see that the individual
variablesxiare independent, givenμ. The mean and covariance of this distribution
are easily seen to be

E[x]=μ (9.45)
cov[x]=diag{μi(1−μi)}. (9.46)

Now let us consider a finite mixture of these distributions given by

p(x|μ,π)=

∑K

k=1

πkp(x|μk) (9.47)

whereμ={μ 1 ,...,μK},π={π 1 ,...,πK}, and

p(x|μk)=

∏D

i=1

μxkii(1−μki)(1−xi). (9.48)

Exercise 9.12 The mean and covariance of this mixture distribution are given by


E[x]=

∑K

k=1

πkμk (9.49)

cov[x]=

∑K

k=1

πk

{
Σk+μkμTk

}
−E[x]E[x]T (9.50)

whereΣk =diag{μki(1−μki)}. Because the covariance matrixcov[x]is no
longer diagonal, the mixture distribution can capture correlations between the vari-
ables, unlike a single Bernoulli distribution.
If we are given a data setX={x 1 ,...,xN}then the log likelihood function
for this model is given by

lnp(X|μ,π)=

∑N

n=1

ln

{K

k=1

πkp(xn|μk)

}

. (9.51)


Again we see the appearance of the summation inside the logarithm, so that the
maximum likelihood solution no longer has closed form.
We now derive the EM algorithm for maximizing the likelihood function for
the mixture of Bernoulli distributions. To do this, we first introduce an explicit latent
Free download pdf