Understanding Machine Learning: From Theory to Algorithms

138 Boosting

(h 1 (x),...,hT(x))∈RT, and then applying the (homogenous) halfspace defined bywonψ(x). In this section we analyze the estimation error ofL(B,T) by bounding the VC-dimension ofL(B,T) in terms of the VC-dimension ofBandT. We will show that, up to logarithmic factors, the VC-dimension ofL(B,T) is bounded byTtimes the VC-dimension ofB. It follows that the estimation error of Ad- aBoost grows linearly withT. On the other hand, the empirical risk of AdaBoost decreases withT. In fact, as we demonstrate later,Tcan be used to decrease the approximation error ofL(B,T). Therefore, the parameterTof AdaBoost enables us to control the bias-complexity tradeoff. To demonstrate how the expressive power ofL(B,T) increases withT, consider the simple example, in whichX=Rand the base class is Decision Stumps,

HDS1={x7→sign(x−θ)·b: θ∈R,b∈{± 1 }}.

Note that in this one dimensional case,HDS1is in fact equivalent to (nonho- mogenous) halfspaces onR. Now, letHbe the rather complex class (compared to halfspaces on the line) of piece-wise constant functions. Letgrbe a piece-wise constant function with at mostrpieces; that is, there exist thresholds−∞=θ 0 < θ 1 < θ 2 <···< θr=∞ such that

gr(x) =

∑r

i=1

αi (^1) [x∈(θi− 1 ,θi]] ∀i, αi∈{± 1 }.
Denote byGrthe class of all such piece-wise constant classifiers with at mostr
pieces.
In the following we show thatGT⊆L(HDS1,T); namely, the class of halfspaces
overTdecision stumps yields all the piece-wise constant classifiers with at most
Tpieces.
Indeed, without loss of generality consider anyg∈ GTwithαt= (−1)t. This
implies that ifxis in the interval (θt− 1 ,θt], theng(x) = (−1)t. For example:
Now, the function
h(x) = sign

(T

∑

t=1

wtsign(x−θt− 1 )

)

, (10.5)

wherew 1 = 0.5 and fort >1,wt= (−1)t, is inL(HDS1,T) and is equal tog (see Exercise 2).

Understanding Machine Learning: From Theory to Algorithms

(T

∑

)

, (10.5)

Get our desktop app

Company

Features

Documentation

Resources