Understanding Machine Learning: From Theory to Algorithms

24.4 Latent Variables and the EM Algorithm 349

Therefore, the density ofXcan be written as:

P[X=x] =

∑k

y=1

P[Y=y]P[X=x|Y=y]

=

∑k

y=1

cy^1 (2π)d/^2 |Σy|^1 /^2

exp

(

−^1

2

(x−μy)TΣ−y^1 (x−μy)

)

.

Note thatYis a hidden variable that we do not observe in our data. Neverthe-
less, we introduceY since it helps us describe a simple parametric form of the
probability ofX.
More generally, letθbe the parameters of the joint distribution ofXandY
(e.g., in the preceding example,θconsists ofcy,μy, and Σy, for ally= 1,...,k).
Then, the log-likelihood of an observationxcan be written as

log (Pθ[X=x]) = log

(k ∑

y=1

Pθ[X=x,Y=y]

)

.

Given an i.i.d. sample,S= (x 1 ,...,xm), we would like to findθthat maxi-
mizes the log-likelihood ofS,

L(θ) = log

∏m

i=1

Pθ[X=xi]

=

∑m

i=1

logPθ[X=xi]

=

∑m

i=1

log

(k ∑

y=1

Pθ[X=xi,Y=y]

)

.

The maximum-likelihood estimator is therefore the solution of the maximization
problem

argmax θ

L(θ) = argmax θ

∑m

i=1

log

(k ∑

y=1

Pθ[X=xi,Y=y]

)

.

In many situations, the summation inside the log makes the preceding opti-
mization problem computationally hard. TheExpectation-Maximization(EM)
algorithm, due to Dempster, Laird, and Rubin, is an iterative procedure for
searching a (local) maximum ofL(θ). While EM is not guaranteed to find the
global maximum, it often works reasonably well in practice.
EM is designed for those cases in which, had we known the values of the latent
variablesY, then the maximum likelihood optimization problem would have been
tractable. More precisely, define the following function overm×kmatrices and
the set of parametersθ:

F(Q,θ) =

∑m

i=1

∑k

y=1

Qi,ylog (Pθ[X=xi,Y=y]).

Understanding Machine Learning: From Theory to Algorithms

=

(

−^1

2

)

.

)

.

=

=

)

.

)

.

Get our desktop app

Company

Features

Documentation

Resources