Pattern Recognition and Machine Learning

672 14. COMBINING MODELS

The M step involves maximization of this function with respect toθ, keepingθold, and henceγnk, fixed. Maximization with respect toπkcan be done in the usual way, with a Lagrange multiplier to enforce the summation constraint

∑ kπk=1, giving the familiar result

πk=

1

N

∑N

n=1

γnk. (14.50)

To determine the{wk}, we note that theQ(θ,θold)function comprises a sum
over terms indexed bykeach of which depends only on one of the vectorswk,so
that the different vectors are decoupled in the M step of the EM algorithm. In other
words, the different components interact only via the responsibilities, which are fixed
during the M step. Note that the M step does not have a closed-form solution and
must be solved iteratively using, for instance, the iterative reweighted least squares
Section 4.3.3 (IRLS) algorithm. The gradient and the Hessian for the vectorwkare given by

∇kQ =

∑N

n=1

γnk(tn−ynk)φn (14.51)

Hk = −∇k∇kQ=

∑N

n=1

γnkynk(1−ynk)φnφTn (14.52)

where∇kdenotes the gradient with respect towk. For fixedγnk, these are indepen-
dent of{wj}forj =kand so we can solve for eachwkseparately using the IRLS
Section 4.3.3 algorithm. Thus the M-step equations for componentkcorrespond simply to fitting
a single logistic regression model to a weighted data set in which data pointncarries
a weightγnk. Figure 14.10 shows an example of the mixture of logistic regression
models applied to a simple classification problem. The extension of this model to a
Exercise 14.16 mixture of softmax models for more than two classes is straightforward.

14.5.3 Mixtures of experts

In Section 14.5.1, we considered a mixture of linear regression models, and in Section 14.5.2 we discussed the analogous mixture of linear classifiers. Although these simple mixtures extend the flexibility of linear models to include more com- plex (e.g., multimodal) predictive distributions, they are still very limited. We can further increase the capability of such models by allowing the mixing coefficients themselves to be functions of the input variable, so that

p(t|x)=

∑K

k=1

πk(x)pk(t|x). (14.53)

This is known as amixture of expertsmodel (Jacobset al., 1991) in which the mixing coefficientsπk(x)are known asgatingfunctions and the individual component densitiespk(t|x)are calledexperts. The notion behind the terminology is that different components can model the distribution in different regions of input space (they

Pattern Recognition and Machine Learning

672 14. COMBINING MODELS

1

N

14.5.3 Mixtures of experts

Get our desktop app

Company

Features

Documentation

Resources