Understanding Machine Learning: From Theory to Algorithms

58 Learning via Uniform Convergence

classes is bounded by^128 d+2 log(2 2 /δ). This upper bound on the sample complexity has the deficiency of being dependent on the specific representation of real numbers used by our machine. In Chapter 6 we will introduce a rigorous way to analyze the sample complexity of infinite size hypothesis classes. Neverthe- less, the discretization trick can be used to get a rough estimate of the sample complexity in many practical situations.

4.3 Summary x Contents

If the uniform convergence property holds for a hypothesis classHthen in most cases the empirical risks of hypotheses inHwill faithfully represent their true risks. Uniform convergence suffices for agnostic PAC learnability using the ERM rule. We have shown that finite hypothesis classes enjoy the uniform convergence property and are hence agnostic PAC learnable.

4.4 Bibliographic Remarks

Classes of functions for which the uniform convergence property holds are also called Glivenko-Cantelli classes, named after Valery Ivanovich Glivenko and Francesco Paolo Cantelli, who proved the first uniform convergence result in the 1930s. See (Dudley, Gine & Zinn 1991). The relation between uniform convergence and learnability was thoroughly studied by Vapnik – see (Vapnik 1992, Vapnik 1995, Vapnik 1998). In fact, as we will see later in Chapter 6, the funda- mental theorem of learning theory states that in binary classification problems, uniform convergence is not only a sufficient condition for learnability but is also a necessary condition. This is not the case for more general learning problems (see (Shalev-Shwartz, Shamir, Srebro & Sridharan 2010)).

4.5 Exercises

In this exercise, we show that the (,δ) requirement on the convergence of
errors in our definitions of PAC learning, is, in fact, quite close to a sim-
pler looking requirement about averages (or expectations). Prove that the
following two statements are equivalent (for any learning algorithmA, any
probability distributionD, and any loss function whose range is [0,1]):

For every,δ >0, there existsm(,δ) such that∀m≥m(,δ)

S∼DPm[LD(A(S))> ]< δ 2.

mlim→∞ E S∼Dm

[LD(A(S))] = 0

Understanding Machine Learning: From Theory to Algorithms

4.3 Summary x Contents

4.4 Bibliographic Remarks

4.5 Exercises

[LD(A(S))] = 0

Get our desktop app

Company

Features

Documentation

Resources