Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

58 Learning via Uniform Convergence


classes is bounded by^128 d+2 log(2 2 /δ). This upper bound on the sample complex-
ity has the deficiency of being dependent on the specific representation of real
numbers used by our machine. In Chapter 6 we will introduce a rigorous way
to analyze the sample complexity of infinite size hypothesis classes. Neverthe-
less, the discretization trick can be used to get a rough estimate of the sample
complexity in many practical situations.

4.3 Summary x Contents


If the uniform convergence property holds for a hypothesis classHthen in most
cases the empirical risks of hypotheses inHwill faithfully represent their true
risks. Uniform convergence suffices for agnostic PAC learnability using the ERM
rule. We have shown that finite hypothesis classes enjoy the uniform convergence
property and are hence agnostic PAC learnable.

4.4 Bibliographic Remarks


Classes of functions for which the uniform convergence property holds are also
called Glivenko-Cantelli classes, named after Valery Ivanovich Glivenko and
Francesco Paolo Cantelli, who proved the first uniform convergence result in
the 1930s. See (Dudley, Gine & Zinn 1991). The relation between uniform con-
vergence and learnability was thoroughly studied by Vapnik – see (Vapnik 1992,
Vapnik 1995, Vapnik 1998). In fact, as we will see later in Chapter 6, the funda-
mental theorem of learning theory states that in binary classification problems,
uniform convergence is not only a sufficient condition for learnability but is also
a necessary condition. This is not the case for more general learning problems
(see (Shalev-Shwartz, Shamir, Srebro & Sridharan 2010)).

4.5 Exercises



  1. In this exercise, we show that the (,δ) requirement on the convergence of
    errors in our definitions of PAC learning, is, in fact, quite close to a sim-
    pler looking requirement about averages (or expectations). Prove that the
    following two statements are equivalent (for any learning algorithmA, any
    probability distributionD, and any loss function whose range is [0,1]):

    1. For every,δ >0, there existsm(,δ) such that∀m≥m(,δ)




S∼DPm[LD(A(S))> ]< δ
2.

mlim→∞ E
S∼Dm

[LD(A(S))] = 0
Free download pdf