Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

86 Nonuniform Learnability


Concretely, letHbe a hypothesis class that can be written asH=


n∈NHn.
For example,Hmay be the class of all polynomial classifiers where eachHnis
the class of polynomial classifiers of degreen(see Example 7.1). Assume that for
eachn, the classHnenjoys the uniform convergence property (see Definition 4.3
in Chapter 4) with a sample complexity functionmUCHn(,δ). Let us also define
the functionn:N×(0,1)→(0,1) by

n(m,δ) = min{∈(0,1) :mUCHn(,δ)≤m}. (7.1)

In words, we have a fixed sample sizem, and we are interested in the lowest
possible upper bound on the gap between empirical and true risks achievable by
using a sample ofmexamples.
From the definitions of uniform convergence andn, it follows that for every
mandδ, with probability of at least 1−δover the choice ofS∼ Dmwe have
that
∀h∈Hn, |LD(h)−LS(h)|≤n(m,δ). (7.2)

Letw:N→[0,1] be a function such that

∑∞

n=1w(n)≤1. We refer towas
aweight functionover the hypothesis classesH 1 ,H 2 ,.... Such a weight function
can reflect the importance that the learner attributes to each hypothesis class,
or some measure of the complexity of different hypothesis classes. IfHis a finite
union ofNhypothesis classes, one can simply assign the same weight of 1/Nto
all hypothesis classes. This equal weighting corresponds to no a priori preference
to any hypothesis class. Of course, if one believes (as prior knowledge) that a
certain hypothesis class is more likely to contain the correct target function,
then it should be assigned a larger weight, reflecting this prior knowledge. When
His a (countable) infinite union of hypothesis classes, a uniform weighting is
not possible but many other weighting schemes may work. For example, one can

choosew(n) =π (^26) n 2 orw(n) = 2−n. Later in this chapter we will provide another
convenient way to define weighting functions using description languages.
The SRM rule follows a “bound minimization” approach. This means that
the goal of the paradigm is to find a hypothesis that minimizes a certain upper
bound on the true risk. The bound that the SRM rule wishes to minimize is
given in the following theorem.
theorem7.4 Letw:N→[0,1]be a function such that


∑∞

n=1w(n)≤^1. Let
Hbe a hypothesis class that can be written asH=


n∈NHn, where for eachn,
Hnsatisfies the uniform convergence property with a sample complexity function
mUCHn. Letnbe as defined in Equation (7.1). Then, for everyδ∈(0,1)and
distributionD, with probability of at least 1 −δover the choice ofS∼Dm, the
following bound holds (simultaneously) for everyn∈Nandh∈Hn.

|LD(h)−LS(h)|≤n(m,w(n)·δ).

Therefore, for everyδ∈(0,1)and distributionD, with probability of at least
Free download pdf