24.4 Latent Variables and the EM Algorithm 351The second term is the sum of theentropiesof the rows ofQ. Let
Q=
{
Q∈[0,1]m,k:∀i,∑ky=1Qi,y= 1}
be the set of matrices whose rows define probabilities over [k]. The following
lemma shows that EM performs alternate maximization iterations for maximiz-
ingG.
lemma24.2 The EM procedure can be rewritten as:
Q(t+1) = argmax
Q∈QG(Q,θ(t))θ(t+1) = argmax
θG(Q(t+1),θ).Furthermore,G(Q(t+1),θ(t)) =L(θ(t)).
Proof GivenQ(t+1)we clearly have that
argmax
θG(Q(t+1),θ) = argmax
θF(Q(t+1),θ).Therefore, we only need to show that for anyθ, the solution of argmaxQ∈QG(Q,θ)
is to setQi,y=Pθ[Y=y|X=xi]. Indeed, by Jensen’s inequality, for anyQ∈Q
we have that
G(Q,θ) =∑mi=1(k
∑y=1Qi,ylog(
Pθ[X=xi,Y=y]
Qi,y))
≤
∑mi=1(
log(k
∑y=1Qi,yPθ[X=xi,Y=y]
Qi,y))
=
∑mi=1log(k
∑y=1Pθ[X=xi,Y=y])
=
∑mi=1log (Pθ[X=xi]) =L(θ),