Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

292 Online Learning


x=j. Then, it is easy to show that Ldim(H) = 1 while|H|=dcan be arbitrarily
large. Therefore, this example shows that Ldim(H) can be significantly smaller
than log 2 (|H|).

Example 21.4 LetX= [0,1] andH={x7→ (^1) [x<a]:a∈[0,1]}; namely,His
the class of thresholds on the interval [0,1]. Then, Ldim(H) =∞. To see this,
consider the tree
1 / 2


1 / 4

1 / 8 3 / 8

3 / 4

5 / 8 7 / 8

This tree is shattered byH. And, because of the density of the reals, this tree
can be made arbitrarily deep.
Lemma 21.6 states that Ldim(H) lower bounds the mistake bound of any
algorithm. Interestingly, there is a standard algorithm whose mistake bound
matches this lower bound. The algorithm is similar to theHalvingalgorithm.
Recall that the prediction ofHalvingis made according to a majority vote of
the hypotheses which are consistent with previous examples. We denoted this
set byVt. Put another way,HalvingpartitionsVtinto two sets:Vt+={h∈Vt:
h(xt) = 1}andVt−={h∈Vt:h(xt) = 0}. It then predicts according to the
larger of the two groups. The rationale behind this prediction is that whenever
Halvingmakes a mistake it ends up with|Vt+1|≤ 0. 5 |Vt|.
The optimal algorithm we present in the following uses the same idea, but
instead of predicting according to the larger class, it predicts according to the
class with larger Ldim.

Standard Optimal Algorithm (SOA)

input:A hypothesis classH
initialize:V 1 =H
fort= 1, 2 ,...
receivext
forr∈{ 0 , 1 }letVt(r)={h∈Vt:h(xt) =r}
predictpt= argmaxr∈{ 0 , 1 }Ldim(Vt(r))
(in case of a tie predictpt= 1)
receive true labelyt
updateVt+1={h∈Vt:h(xt) =yt}

The following lemma formally establishes the optimality of the preceding al-
gorithm.
Free download pdf