Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
6.5 Proof of Theorem 6.7 73

1.Hhas the uniform convergence property with sample complexity

C 1 d+ log(1/δ)
^2

≤mUCH(,δ)≤C 2 d+ log(1/δ)
^2
2.His agnostic PAC learnable with sample complexity

C 1

d+ log(1/δ)
^2

≤ mH(,δ)≤C 2

d+ log(1/δ)
^2
3.His PAC learnable with sample complexity

C 1 d+ log(1/δ)


≤mH(,δ)≤C 2 dlog(1/) + log(1/δ)

The proof of this theorem is given in Chapter 28.
Remark 6.3 We stated the fundamental theorem for binary classification tasks.
A similar result holds for some other learning problems such as regression with
the absolute loss or the squared loss. However, the theorem does not hold for
all learning tasks. In particular, learnability is sometimes possible even though
the uniform convergence property does not hold (we will see an example in
Chapter 13, Exercise 2). Furthermore, in some situations, the ERM rule fails
but learnability is possible with other learning rules.

6.5 Proof of Theorem 6.7


We have already seen that 1→2 in Chapter 4. The implications 2→3 and
3 →4 are trivial and so is 2→5. The implications 4→6 and 5→6 follow from
the No-Free-Lunch theorem. The difficult part is to show that 6→1. The proof
is based on two main claims:


  • If VCdim(H) =d, then even thoughHmight be infinite, when restricting it
    to a finite setC⊂ X, its “effective” size,|HC|, is onlyO(|C|d). That is,
    the size ofHCgrows polynomially rather than exponentially with|C|. This
    claim is often referred to asSauer’s lemma, but it has also been stated and
    proved independently by Shelah and by Perles. The formal statement is
    given in Section 6.5.1 later.

  • In Section 4 we have shown that finite hypothesis classes enjoy the uniform
    convergence property. In Section 6.5.2 later we generalize this result and
    show that uniform convergence holds whenever the hypothesis class has a
    “small effective size.” By “small effective size” we mean classes for which
    |HC|grows polynomially with|C|.


6.5.1 Sauer’s Lemma and the Growth Function


We defined the notion ofshattering, by considering the restriction ofHto a finite
set of instances. The growth function measures the maximal “effective” size of
Hon a set ofmexamples. Formally:
Free download pdf