Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
10.1 Weak Learnability 133

To see that, we first show that for every distribution that is consistent with
H, there exists a decision stump withLD(h) ≤ 1 /3. Indeed, just note that
every classifier inHconsists of three regions (two unbounded rays and a center
interval) with alternate labels. For any pair of such regions, there exists a decision
stump that agrees with the labeling of these two components. Note that for every
distributionDoverRand every partitioning of the line into three such regions,
one of these regions must haveD-weight of at most 1/3. Leth∈ Hbe a zero
error hypothesis. A decision stump that disagrees withhonly on such a region
has an error of at most 1/3.
Finally, since the VC-dimension of decision stumps is 2, if the sample size is
greater than Ω(log(1/δ)/^2 ), then with probability of at least 1−δ, the ERMB
rule returns a hypothesis with an error of at most 1/3 +. Setting= 1/12 we
obtain that the error of ERMBis at most 1/3 + 1/12 = 1/ 2 − 1 /12.
We see that ERMBis aγ-weak learner forH. We next show how to implement
the ERM rule efficiently for decision stumps.

10.1.1 Efficient Implementation of ERM for Decision Stumps


LetX=Rdand consider the base hypothesis class of decision stumps overRd,
namely,
HDS={x7→sign(θ−xi)·b: θ∈R,i∈[d],b∈{± 1 }}.

For simplicity, assume thatb= 1; that is, we focus on all the hypotheses in
HDSof the form sign(θ−xi). LetS= ((x 1 ,y 1 ),...,(xm,ym)) be a training set.
We will show how to implement an ERM rule, namely, how to find a decision
stump that minimizesLS(h). Furthermore, since in the next section we will
show that AdaBoost requires finding a hypothesis with a small risk relative to
some distribution overS, we will show here how to minimize such risk functions.
Concretely, letDbe a probability vector inRm(that is, all elements ofDare
nonnegative and


iDi= 1). The weak learner we describe later receivesDand
Sand outputs a decision stumph:X →Ythat minimizes the risk w.r.t.D,

LD(h) =

∑m

i=1

Di (^1) [h(xi) 6 =yi].
Note that ifD= (1/m,..., 1 /m) thenLD(h) =LS(h).
Recall that each decision stump is parameterized by an indexj∈[d] and a
thresholdθ. Therefore, minimizingLD(h) amounts to solving the problem
min
j∈[d]
min
θ∈R





i:yi=1

Di (^1) [xi,j>θ]+



i:yi=− 1

Di (^1) [xi,j≤θ]



. (10.1)

Fixj∈[d] and let us sort the examples so thatx 1 ,j≤x 2 ,j≤...≤xm,j. Define
Θj={xi,j+ 2 xi+1,j:i∈[m−1]}∪{(x 1 ,j−1),(xm,j+ 1)}. Note that for anyθ∈R
there existsθ′∈Θjthat yields the same predictions for the sampleSas the
Free download pdf