Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

68 The VC-Dimension


size. Nevertheless, the following lemma shows thatHis learnable in the PAC
model using the ERM algorithm.
Lemma 6.1 LetHbe the class of thresholds as defined earlier. Then,His
PAC learnable, using the ERM rule, with sample complexity ofmH(,δ) ≤
dlog(2/δ)/e.

Proof Leta?be a threshold such that the hypothesish?(x) = (^1) [x<a?]achieves
LD(h?) = 0. LetDxbe the marginal distribution over the domainX and let
a 0 < a?< a 1 be such that
P
x∼Dx
[x∈(a 0 ,a?)] = P
x∼Dx
[x∈(a?,a 1 )] =.
a (^0) a? a 1
mass mass
(IfDx(−∞,a?)≤we seta 0 =−∞and similarly fora 1 ). Given a training set
S, letb 0 = max{x: (x,1)∈S}andb 1 = min{x: (x,0)∈S}(if no example inS
is positive we setb 0 =−∞and if no example inSis negative we setb 1 =∞).
LetbSbe a threshold corresponding to an ERM hypothesis,hS, which implies
thatbS∈(b 0 ,b 1 ). Therefore, a sufficient condition forLD(hS)≤is that both
b 0 ≥a 0 andb 1 ≤a 1. In other words,
P
S∼Dm
[LD(hS)> ]≤ P
S∼Dm
[b 0 < a 0 ∨b 1 > a 1 ],
and using the union bound we can bound the preceding by
S∼DPm[LD(hS)> ]≤S∼DPm[b^0 < a^0 ] +S∼DPm[b^1 > a^1 ]. (6.1)
The eventb 0 < a 0 happens if and only if all examples inSare not in the interval
(a 0 ,a∗), whose probability mass is defined to be, namely,
P
S∼Dm
[b 0 < a 0 ] = P
S∼Dm
[∀(x,y)∈S, x6∈(a 0 ,a?)] = (1−)m≤e−m.
Since we assumem >log(2/δ)/it follows that the equation is at mostδ/2.
In the same way it is easy to see thatPS∼Dm[b 1 > a 1 ]≤δ/2. Combining with
Equation (6.1) we conclude our proof.


6.2 The VC-Dimension


We see, therefore, that while finiteness ofHis a sufficient condition for learn-
ability, it is not a necessary condition. As we will show, a property called the
VC-dimension of a hypothesis class gives the correct characterization of its learn-
ability. To motivate the definition of the VC-dimension, let us recall the No-Free-
Lunch theorem (Theorem 5.1) and its proof. There, we have shown that without
Free download pdf