Understanding Machine Learning: From Theory to Algorithms

24.4 Latent Variables and the EM Algorithm 351

The second term is the sum of theentropiesof the rows ofQ. Let

Q=

{

Q∈[0,1]m,k:∀i,

∑k

y=1

Qi,y= 1

}

be the set of matrices whose rows define probabilities over [k]. The following
lemma shows that EM performs alternate maximization iterations for maximiz-
ingG.

lemma24.2 The EM procedure can be rewritten as:

Q(t+1) = argmax Q∈Q

G(Q,θ(t))

θ(t+1) = argmax θ

G(Q(t+1),θ).

Furthermore,G(Q(t+1),θ(t)) =L(θ(t)).

Proof GivenQ(t+1)we clearly have that

argmax θ

G(Q(t+1),θ) = argmax θ

F(Q(t+1),θ).

Therefore, we only need to show that for anyθ, the solution of argmaxQ∈QG(Q,θ)
is to setQi,y=Pθ[Y=y|X=xi]. Indeed, by Jensen’s inequality, for anyQ∈Q
we have that

G(Q,θ) =

∑m

i=1

(k ∑

y=1

Qi,ylog

(

Pθ[X=xi,Y=y] Qi,y

))

≤

∑m

i=1

(

log

(k ∑

y=1

Qi,yPθ[X=xi,Y=y] Qi,y

))

=

∑m

i=1

log

(k ∑

y=1

Pθ[X=xi,Y=y]

)

=

∑m

i=1

log (Pθ[X=xi]) =L(θ),

Understanding Machine Learning: From Theory to Algorithms

Q=

{

}

(

))

≤

(

))

=

)

=

Get our desktop app

Company

Features

Documentation

Resources