Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
24.4 Latent Variables and the EM Algorithm 351

The second term is the sum of theentropiesof the rows ofQ. Let


Q=

{

Q∈[0,1]m,k:∀i,

∑k

y=1

Qi,y= 1

}

be the set of matrices whose rows define probabilities over [k]. The following
lemma shows that EM performs alternate maximization iterations for maximiz-
ingG.


lemma24.2 The EM procedure can be rewritten as:


Q(t+1) = argmax
Q∈Q

G(Q,θ(t))

θ(t+1) = argmax
θ

G(Q(t+1),θ).

Furthermore,G(Q(t+1),θ(t)) =L(θ(t)).


Proof GivenQ(t+1)we clearly have that


argmax
θ

G(Q(t+1),θ) = argmax
θ

F(Q(t+1),θ).

Therefore, we only need to show that for anyθ, the solution of argmaxQ∈QG(Q,θ)
is to setQi,y=Pθ[Y=y|X=xi]. Indeed, by Jensen’s inequality, for anyQ∈Q
we have that


G(Q,θ) =

∑m

i=1

(k

y=1

Qi,ylog

(

Pθ[X=xi,Y=y]
Qi,y

))


∑m

i=1

(

log

(k

y=1

Qi,yPθ[X=xi,Y=y]
Qi,y

))

=

∑m

i=1

log

(k

y=1

Pθ[X=xi,Y=y]

)

=

∑m

i=1

log (Pθ[X=xi]) =L(θ),
Free download pdf