Pattern Recognition and Machine Learning

9.3. An Alternative View of EM 445

Consider a set ofDbinary variablesxi, wherei=1,...,D, each of which is governed by a Bernoulli distribution with parameterμi, so that

p(x|μ)=

∏D

i=1

μxii(1−μi)(1−xi) (9.44)

wherex =(x 1 ,...,xD)Tandμ =(μ 1 ,...,μD)T. We see that the individual variablesxiare independent, givenμ. The mean and covariance of this distribution are easily seen to be

E[x]=μ (9.45) cov[x]=diag{μi(1−μi)}. (9.46)

Now let us consider a finite mixture of these distributions given by

p(x|μ,π)=

∑K

k=1

πkp(x|μk) (9.47)

whereμ={μ 1 ,...,μK},π={π 1 ,...,πK}, and

p(x|μk)=

∏D

i=1

μxkii(1−μki)(1−xi). (9.48)

Exercise 9.12 The mean and covariance of this mixture distribution are given by

E[x]=

∑K

k=1

πkμk (9.49)

cov[x]=

∑K

k=1

πk

{ Σk+μkμTk

} −E[x]E[x]T (9.50)

whereΣk =diag{μki(1−μki)}. Because the covariance matrixcov[x]is no longer diagonal, the mixture distribution can capture correlations between the vari- ables, unlike a single Bernoulli distribution. If we are given a data setX={x 1 ,...,xN}then the log likelihood function for this model is given by

lnp(X|μ,π)=

∑N

n=1

ln

{K ∑

k=1

πkp(xn|μk)

}

. (9.51)

Again we see the appearance of the summation inside the logarithm, so that the maximum likelihood solution no longer has closed form. We now derive the EM algorithm for maximizing the likelihood function for the mixture of Bernoulli distributions. To do this, we first introduce an explicit latent

Pattern Recognition and Machine Learning

Get our desktop app

Company

Features

Documentation

Resources