24.4 Latent Variables and the EM Algorithm 351
The second term is the sum of theentropiesof the rows ofQ. Let
Q=
{
Q∈[0,1]m,k:∀i,
∑k
y=1
Qi,y= 1
}
be the set of matrices whose rows define probabilities over [k]. The following
lemma shows that EM performs alternate maximization iterations for maximiz-
ingG.
lemma24.2 The EM procedure can be rewritten as:
Q(t+1) = argmax
Q∈Q
G(Q,θ(t))
θ(t+1) = argmax
θ
G(Q(t+1),θ).
Furthermore,G(Q(t+1),θ(t)) =L(θ(t)).
Proof GivenQ(t+1)we clearly have that
argmax
θ
G(Q(t+1),θ) = argmax
θ
F(Q(t+1),θ).
Therefore, we only need to show that for anyθ, the solution of argmaxQ∈QG(Q,θ)
is to setQi,y=Pθ[Y=y|X=xi]. Indeed, by Jensen’s inequality, for anyQ∈Q
we have that
G(Q,θ) =
∑m
i=1
(k
∑
y=1
Qi,ylog
(
Pθ[X=xi,Y=y]
Qi,y
))
≤
∑m
i=1
(
log
(k
∑
y=1
Qi,yPθ[X=xi,Y=y]
Qi,y
))
=
∑m
i=1
log
(k
∑
y=1
Pθ[X=xi,Y=y]
)
=
∑m
i=1
log (Pθ[X=xi]) =L(θ),