Understanding Machine Learning: From Theory to Algorithms

86 Nonuniform Learnability

Concretely, letHbe a hypothesis class that can be written asH=

⋃

n∈NHn. For example,Hmay be the class of all polynomial classifiers where eachHnis the class of polynomial classifiers of degreen(see Example 7.1). Assume that for eachn, the classHnenjoys the uniform convergence property (see Definition 4.3 in Chapter 4) with a sample complexity functionmUCHn(,δ). Let us also define the functionn:N×(0,1)→(0,1) by

n(m,δ) = min{∈(0,1) :mUCHn(,δ)≤m}. (7.1)

In words, we have a fixed sample sizem, and we are interested in the lowest possible upper bound on the gap between empirical and true risks achievable by using a sample ofmexamples. From the definitions of uniform convergence andn, it follows that for every mandδ, with probability of at least 1−δover the choice ofS∼ Dmwe have that ∀h∈Hn, |LD(h)−LS(h)|≤n(m,δ). (7.2)

Letw:N→[0,1] be a function such that

∑∞

n=1w(n)≤1. We refer towas aweight functionover the hypothesis classesH 1 ,H 2 ,.... Such a weight function can reflect the importance that the learner attributes to each hypothesis class, or some measure of the complexity of different hypothesis classes. IfHis a finite union ofNhypothesis classes, one can simply assign the same weight of 1/Nto all hypothesis classes. This equal weighting corresponds to no a priori preference to any hypothesis class. Of course, if one believes (as prior knowledge) that a certain hypothesis class is more likely to contain the correct target function, then it should be assigned a larger weight, reflecting this prior knowledge. When His a (countable) infinite union of hypothesis classes, a uniform weighting is not possible but many other weighting schemes may work. For example, one can

choosew(n) =π (^26) n 2 orw(n) = 2−n. Later in this chapter we will provide another
convenient way to define weighting functions using description languages.
The SRM rule follows a “bound minimization” approach. This means that
the goal of the paradigm is to find a hypothesis that minimizes a certain upper
bound on the true risk. The bound that the SRM rule wishes to minimize is
given in the following theorem.
theorem7.4 Letw:N→[0,1]be a function such that

∑∞

n=1w(n)≤^1. Let Hbe a hypothesis class that can be written asH=

⋃

n∈NHn, where for eachn, Hnsatisfies the uniform convergence property with a sample complexity function mUCHn. Letnbe as defined in Equation (7.1). Then, for everyδ∈(0,1)and distributionD, with probability of at least 1 −δover the choice ofS∼Dm, the following bound holds (simultaneously) for everyn∈Nandh∈Hn.

|LD(h)−LS(h)|≤n(m,w(n)·δ).

Therefore, for everyδ∈(0,1)and distributionD, with probability of at least

Understanding Machine Learning: From Theory to Algorithms

⋃

∑∞

∑∞

⋃

Get our desktop app

Company

Features

Documentation

Resources