Pattern Recognition and Machine Learning

654 14. COMBINING MODELS

model combination is to select one of the models to make the prediction, in which the choice of model is a function of the input variables. Thus different models be- come responsible for making predictions in different regions of input space. One widely used framework of this kind is known as adecision treein which the selection process can be described as a sequence of binary selections corresponding to the traversal of a tree structure and is discussed in Section 14.4. In this case, the individual models are generally chosen to be very simple, and the overall flexibility of the model arises from the input-dependent selection process. Decision trees can be applied to both classification and regression problems. One limitation of decision trees is that the division of input space is based on hard splits in which only one model is responsible for making predictions for any given value of the input variables. The decision process can be softened by moving to a probabilistic framework for combining models, as discussed in Section 14.5. For example, if we have a set ofKmodels for a conditional distributionp(t|x,k)where xis the input variable,tis the target variable, andk=1,...,Kindexes the model, then we can form a probabilistic mixture of the form

p(t|x)=

∑K

k=1

πk(x)p(t|x,k) (14.1)

in whichπk(x)=p(k|x)represent the input-dependent mixing coefficients. Such models can be viewed as mixture distributions in which the component densities, as well as the mixing coefficients, are conditioned on the input variables and are known asmixtures of experts. They are closely related to the mixture density network model discussed in Section 5.6.

14.1 Bayesian Model Averaging

It is important to distinguish between model combination methods and Bayesian
model averaging, as the two are often confused. To understand the difference, con-
Section 9.2 sider the example of density estimation using a mixture of Gaussians in which several
Gaussian components are combined probabilistically. The model contains a binary
latent variablezthat indicates which component of the mixture is responsible for
generating the corresponding data point. Thus the model is specified in terms of a
joint distribution
p(x,z) (14.2)
and the corresponding density over the observed variablexis obtained by marginal-
izing over the latent variable

p(x)=

∑

z

p(x,z). (14.3)

Pattern Recognition and Machine Learning

654 14. COMBINING MODELS

14.1 Bayesian Model Averaging

Get our desktop app

Company

Features

Documentation

Resources