Understanding Machine Learning: From Theory to Algorithms

10.1 Weak Learnability 133

To see that, we first show that for every distribution that is consistent with H, there exists a decision stump withLD(h) ≤ 1 /3. Indeed, just note that every classifier inHconsists of three regions (two unbounded rays and a center interval) with alternate labels. For any pair of such regions, there exists a decision stump that agrees with the labeling of these two components. Note that for every distributionDoverRand every partitioning of the line into three such regions, one of these regions must haveD-weight of at most 1/3. Leth∈ Hbe a zero error hypothesis. A decision stump that disagrees withhonly on such a region has an error of at most 1/3. Finally, since the VC-dimension of decision stumps is 2, if the sample size is greater than Ω(log(1/δ)/^2 ), then with probability of at least 1−δ, the ERMB rule returns a hypothesis with an error of at most 1/3 +. Setting= 1/12 we obtain that the error of ERMBis at most 1/3 + 1/12 = 1/ 2 − 1 /12. We see that ERMBis aγ-weak learner forH. We next show how to implement the ERM rule efficiently for decision stumps.

10.1.1 Efficient Implementation of ERM for Decision Stumps

LetX=Rdand consider the base hypothesis class of decision stumps overRd, namely, HDS={x7→sign(θ−xi)·b: θ∈R,i∈[d],b∈{± 1 }}.

For simplicity, assume thatb= 1; that is, we focus on all the hypotheses in HDSof the form sign(θ−xi). LetS= ((x 1 ,y 1 ),...,(xm,ym)) be a training set. We will show how to implement an ERM rule, namely, how to find a decision stump that minimizesLS(h). Furthermore, since in the next section we will show that AdaBoost requires finding a hypothesis with a small risk relative to some distribution overS, we will show here how to minimize such risk functions. Concretely, letDbe a probability vector inRm(that is, all elements ofDare nonnegative and

∑

iDi= 1). The weak learner we describe later receivesDand Sand outputs a decision stumph:X →Ythat minimizes the risk w.r.t.D,

LD(h) =

∑m

i=1

Di (^1) [h(xi) 6 =yi].
Note that ifD= (1/m,..., 1 /m) thenLD(h) =LS(h).
Recall that each decision stump is parameterized by an indexj∈[d] and a
thresholdθ. Therefore, minimizingLD(h) amounts to solving the problem
min
j∈[d]
min
θ∈R





∑

i:yi=1

Di (^1) [xi,j>θ]+

∑

i:yi=− 1

Di (^1) [xi,j≤θ]



. (10.1)

Fixj∈[d] and let us sort the examples so thatx 1 ,j≤x 2 ,j≤...≤xm,j. Define Θj={xi,j+ 2 xi+1,j:i∈[m−1]}∪{(x 1 ,j−1),(xm,j+ 1)}. Note that for anyθ∈R there existsθ′∈Θjthat yields the same predictions for the sampleSas the

Understanding Machine Learning: From Theory to Algorithms

10.1.1 Efficient Implementation of ERM for Decision Stumps

∑





∑

∑



. (10.1)

Get our desktop app

Company

Features

Documentation

Resources