Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
6.2 The VC-Dimension 69

restricting the hypothesis class, for any learning algorithm, an adversary can
construct a distribution for which the learning algorithm will perform poorly,
while there is another learning algorithm that will succeed on the same distri-
bution. To do so, the adversary used a finite setC⊂Xand considered a family
of distributions that are concentrated on elements ofC. Each distribution was
derived from a “true” target function fromCto{ 0 , 1 }. To make any algorithm
fail, the adversary used the power of choosing a target function from the set of
allpossible functions fromCto{ 0 , 1 }.
When considering PAC learnability of a hypothesis classH, the adversary
is restricted to constructing distributions for which some hypothesish∈ H
achieves a zero risk. Since we are considering distributions that are concentrated
on elements ofC, we should study howHbehaves onC, which leads to the
following definition.


definition6.2 (Restriction ofHtoC) LetHbe a class of functions fromX
to{ 0 , 1 }and letC={c 1 ,...,cm} ⊂ X. The restriction ofHtoCis the set of
functions fromCto{ 0 , 1 }that can be derived fromH. That is,


HC={(h(c 1 ),...,h(cm)) :h∈H},

where we represent each function fromCto{ 0 , 1 }as a vector in{ 0 , 1 }|C|.


If the restriction ofHtoCis the set of all functions fromCto{ 0 , 1 }, then
we say thatHshattersthe setC. Formally:


definition6.3 (Shattering) A hypothesis classHshatters a finite setC⊂X
if the restriction ofHtoCis the set of all functions fromCto{ 0 , 1 }. That is,
|HC|= 2|C|.


Example 6.2 LetHbe the class of threshold functions overR. Take a set
C={c 1 }. Now, if we takea=c 1 + 1, then we haveha(c 1 ) = 1, and if we take
a=c 1 −1, then we haveha(c 1 ) = 0. Therefore,HCis the set of all functions
fromCto{ 0 , 1 }, andHshattersC. Now take a setC={c 1 ,c 2 }, wherec 1 ≤c 2.
Noh∈Hcan account for the labeling (0,1), because any threshold that assigns
the label 0 toc 1 must assign the label 0 toc 2 as well. Therefore not all functions
fromCto{ 0 , 1 }are included inHC; henceCis not shattered byH.


Getting back to the construction of an adversarial distribution as in the proof
of the No-Free-Lunch theorem (Theorem 5.1), we see that whenever some setC
is shattered byH, the adversary is not restricted byH, as they can construct
a distribution overCbased onanytarget function fromCto{ 0 , 1 }, while still
maintaining the realizability assumption. This immediately yields:


corollary6.4 LetHbe a hypothesis class of functions fromXto{ 0 , 1 }. Let
mbe a training set size. Assume that there exists a setC⊂Xof size 2 mthat is
shattered byH. Then, for any learning algorithm,A, there exist a distributionD
overX ×{ 0 , 1 }and a predictorh∈Hsuch thatLD(h) = 0but with probability
of at least 1 / 7 over the choice ofS∼Dmwe have thatLD(A(S))≥ 1 / 8.

Free download pdf