350 Generative Models
If each row ofQdefines a probability over theith latent variable givenX=xi,
then we can interpretF(Q,θ) as the expected log-likelihood of a training set
(x 1 ,y 1 ),...,(xm,ym), where the expectation is with respect to the choice of
eachyion the basis of theith row ofQ. In the definition ofF, the summation is
outside the log, and we assume that this makes the optimization problem with
respect toθtractable:
assumption24.1 For any matrixQ∈[0,1]m,k, such that each row ofQsums
to 1, the optimization problem
argmax
θ
F(Q,θ)
is tractable.
The intuitive idea of EM is that we have a “chicken and egg” problem. On one
hand, had we knownQ, then by our assumption, the optimization problem of
finding the bestθis tractable. On the other hand, had we known the parameters
θwe could have setQi,yto be the probability ofY =ygiven thatX=xi.
The EM algorithm therefore alternates between findingθgivenQand findingQ
givenθ. Formally, EM finds a sequence of solutions (Q(1),θ(1)),(Q(2),θ(2)),...
where at iterationt, we construct (Q(t+1),θ(t+1)) by performing two steps.
- Expectation Step: Set
Q(i,yt+1)=Pθ(t)[Y=y|X=xi]. (24.10)
This step is called the Expectation step, because it yields a new probabil-
ity over the latent variables, which defines a newexpectedlog-likelihood
function overθ.
- Maximization Step: Set θ(t+1) to be the maximizer of the expected log-
likelihood, where the expectation is according toQ(t+1):
θ(t+1)= argmax
θ
F(Q(t+1),θ). (24.11)
By our assumption, it is possible to solve this optimization problem effi-
ciently.
The initial values ofθ(1) andQ(1)are usually chosen at random and the
procedure terminates after the improvement in the likelihood value stops being
significant.
24.4.1 EM as an Alternate Maximization Algorithm xvi Contents
To analyze the EM algorithm, we first view it as an alternate maximization
algorithm. Define the following objective function
G(Q,θ) =F(Q,θ)−
∑m
i=1
∑k
y=1
Qi,ylog(Qi,y).