Pattern Recognition and Machine Learning

(Jeff_L) #1
660 14. COMBINING MODELS

m=1

−1 0 1 2

−2

0

(^2) m=2
−1 0 1 2
−2
0
(^2) m=3
−1 0 1 2
−2
0
2
m=6
−1 0 1 2
−2
0
(^2) m=10
−1 0 1 2
−2
0
(^2) m= 150
−1 0 1 2
−2
0
2
Figure 14.2 Illustration of boosting in which the base learners consist of simple thresholds applied to one or
other of the axes. Each figure shows the numbermof base learners trained so far, along with the decision
boundary of the most recent base learner (dashed black line) and the combined decision boundary of the en-
semble (solid green line). Each data point is depicted by a circle whose radius indicates the weight assigned to
that data point when training the most recently added base learner. Thus, for instance, we see that points that
are misclassified by them=1base learner are given greater weight when training them=2base learner.
Instead of doing a global error function minimization, however, we shall sup-
pose that the base classifiersy 1 (x),...,ym− 1 (x)are fixed, as are their coefficients
α 1 ,...,αm− 1 , and so we are minimizing only with respect toαmandym(x). Sep-
arating off the contribution from base classifierym(x), we can then write the error
function in the form


E =

∑N

n=1

exp

{
−tnfm− 1 (xn)−

1

2

tnαmym(xn)

}

=

∑N

n=1

wn(m)exp

{

1

2

tnαmym(xn)

}
(14.22)

where the coefficientsw(nm) =exp{−tnfm− 1 (xn)}can be viewed as constants
because we are optimizing onlyαmandym(x). If we denote byTmthe set of
data points that are correctly classified byym(x), and if we denote the remaining
misclassified points byMm, then we can in turn rewrite the error function in the
Free download pdf