Pattern Recognition and Machine Learning

(Jeff_L) #1
668 14. COMBINING MODELS

Figure 14.7 Probabilistic directed graph representing a mixture of
linear regression models, defined by (14.35).

zn

tn

φn

W N

β

π

The EM algorithm begins by first choosing an initial valueθoldfor the model param-
eters. In the E step, these parameter values are then used to evaluate the posterior
probabilities, or responsibilities, of each componentkfor every data pointngiven
by

γnk=E[znk]=p(k|φn,θold)=

πkN(tn|wTkφn,β−^1 )

jπjN(tn|w

T
jφn,β

− (^1) ). (14.37)
The responsibilities are then used to determine the expectation, with respect to the
posterior distributionp(Z|t,θold), of the complete-data log likelihood, which takes
the form
Q(θ,θold)=EZ[lnp(t,Z|θ)] =
∑N
n=1
∑K
k=1
γnk
{
lnπk+lnN(tn|wTkφn,β−^1 )
}
.
In the M step, we maximize the functionQ(θ,θold)with respect toθ, keeping the
γnkfixed. For the optimization with respect to the mixing coefficientsπkwe need
to take account of the constraint

kπk =1, which can be done with the aid of a
Exercise 14.14 Lagrange multiplier, leading to an M-step re-estimation equation forπkin the form
πk=


1

N

∑N

n=1

γnk. (14.38)

Note that this has exactly the same form as the corresponding result for a simple
mixture of unconditional Gaussians given by (9.22).
Next consider the maximization with respect to the parameter vectorwkof the
kthlinear regression model. Substituting for the Gaussian distribution, we see that
the functionQ(θ,θold), as a function of the parameter vectorwk, takes the form

Q(θ,θold)=

∑N

n=1

γnk

{

β
2

(
tn−wTkφn

) 2

}
+const (14.39)

where the constant term includes the contributions from other weight vectorswjfor
j =k. Note that the quantity we are maximizing is similar to the (negative of the)
standard sum-of-squares error (3.12) for a single linear regression model, but with
the inclusion of the responsibilitiesγnk. This represents aweighted least squares
Free download pdf