Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

138 Boosting


(h 1 (x),...,hT(x))∈RT, and then applying the (homogenous) halfspace defined
bywonψ(x).
In this section we analyze the estimation error ofL(B,T) by bounding the
VC-dimension ofL(B,T) in terms of the VC-dimension ofBandT. We will
show that, up to logarithmic factors, the VC-dimension ofL(B,T) is bounded
byTtimes the VC-dimension ofB. It follows that the estimation error of Ad-
aBoost grows linearly withT. On the other hand, the empirical risk of AdaBoost
decreases withT. In fact, as we demonstrate later,Tcan be used to decrease
the approximation error ofL(B,T). Therefore, the parameterTof AdaBoost
enables us to control the bias-complexity tradeoff.
To demonstrate how the expressive power ofL(B,T) increases withT, consider
the simple example, in whichX=Rand the base class is Decision Stumps,

HDS1={x7→sign(x−θ)·b: θ∈R,b∈{± 1 }}.

Note that in this one dimensional case,HDS1is in fact equivalent to (nonho-
mogenous) halfspaces onR.
Now, letHbe the rather complex class (compared to halfspaces on the line)
of piece-wise constant functions. Letgrbe a piece-wise constant function with at
mostrpieces; that is, there exist thresholds−∞=θ 0 < θ 1 < θ 2 <···< θr=∞
such that

gr(x) =

∑r

i=1

αi (^1) [x∈(θi− 1 ,θi]] ∀i, αi∈{± 1 }.
Denote byGrthe class of all such piece-wise constant classifiers with at mostr
pieces.
In the following we show thatGT⊆L(HDS1,T); namely, the class of halfspaces
overTdecision stumps yields all the piece-wise constant classifiers with at most
Tpieces.
Indeed, without loss of generality consider anyg∈ GTwithαt= (−1)t. This
implies that ifxis in the interval (θt− 1 ,θt], theng(x) = (−1)t. For example:
Now, the function
h(x) = sign


(T


t=1

wtsign(x−θt− 1 )

)

, (10.5)

wherew 1 = 0.5 and fort >1,wt= (−1)t, is inL(HDS1,T) and is equal tog
(see Exercise 2).
Free download pdf