10.2 AdaBoost 135
a distribution over the examples inS, denotedD(t). That is,D(t)∈Rm+and
∑m
i=1D
(t)
i = 1. Then, the booster passes the distributionD(t)and the sampleS
to the weak learner. (That way, the weak learner can construct i.i.d. examples
according toD(t) andf.) The weak learner is assumed to return a “weak”
hypothesis,ht, whose error,
tdef=LD(t)(ht)def=
∑m
i=1
D(it) (^1) [ht(xi) 6 =yi],
is at most^12 −γ(of course, there is a probability of at mostδthat the weak learner
fails). Then, AdaBoost assigns a weight forhtas follows:wt=^12 log
(
1
t−^1
)
.
That is, the weight ofhtis inversely proportional to the error ofht. At the end
of the round, AdaBoost updates the distribution so that examples on whichht
errs will get a higher probability mass while examples on whichhtis correct will
get a lower probability mass. Intuitively, this will force the weak learner to focus
on the problematic examples in the next round. The output of the AdaBoost
algorithm is a “strong” classifier that is based on a weighted sum of all the weak
hypotheses. The pseudocode of AdaBoost is presented in the following.
AdaBoost
input:
training setS= (x 1 ,y 1 ),...,(xm,ym)
weak learner WL
number of roundsT
initialize D(1)= (m^1 ,...,m^1 ).
fort= 1,...,T:
invoke weak learnerht= WL(D(t),S)
computet=
∑m
i=1D
(t)
i^1 [yi^6 =ht(xi)]
letwt=^12 log
(
1
t−^1
)
updateD(it+1)= D
(it)exp(−wtyiht(xi))
∑m
j=1D(jt)exp(−wtyjht(xj))
for alli= 1,...,m
outputthe hypothesishs(x) = sign
(∑T
t=1wtht(x)
)
.
The following theorem shows that the training error of the output hypothesis
decreases exponentially fast with the number of boosting rounds.
theorem10.2 LetSbe a training set and assume that at each iteration of
AdaBoost, the weak learner returns a hypothesis for whicht≤ 1 / 2 −γ. Then,
the training error of the output hypothesis of AdaBoost is at most
LS(hs) =
1
m
∑m
i=1
(^1) [hs(xi) 6 =yi] ≤ exp(− 2 γ^2 T).
Proof For eacht, denoteft=
∑
p≤twphp. Therefore, the output of AdaBoost