Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
12.2 Convex Learning Problems 165

homogenous case). LetAbe any deterministic algorithm.^1 Assume, by way of
contradiction, thatAis a successful PAC learner for this problem. That is, there
exists a functionm(·,·), such that for every distributionDand for every,δif
Areceives a training set of sizem≥m(,δ), it should output, with probability
of at least 1−δ, a hypothesis ˆw=A(S), such thatLD( ˆw)−minwLD(w)≤.
Choose= 1/ 100 ,δ= 1/2, letm≥m(,δ), and setμ=log(100 2 m/99). We will
define two distributions, and will show thatAis likely to fail on at least one
of them. The first distribution,D 1 , is supported on two examples,z 1 = (1,0)
andz 2 = (μ,−1), where the probability mass of the first example isμwhile the
probability mass of the second example is 1−μ. The second distribution,D 2 , is
supported entirely onz 2.
Observe that for both distributions, the probability that all examples of the
training set will be of the second type is at least 99%. This is trivially true for
D 2 , whereas forD 1 , the probability of this event is


(1−μ)m≥e−^2 μm= 0. 99.

Since we assume thatAis a deterministic algorithm, upon receiving a training
set ofmexamples, each of which is (μ,−1), the algorithm will output some ˆw.
Now, if ˆw <− 1 /(2μ), we will set the distribution to beD 1. Hence,


LD 1 ( ˆw)≥μ( ˆw)^2 ≥ 1 /(4μ).

However,


minw LD 1 (w)≤LD 1 (0) = (1−μ).

It follows that


LD 1 ( ˆw)−minw LD 1 (w)≥

1

4 μ

−(1−μ)> .

Therefore, such algorithmAfails onD 1. On the other hand, if ˆw≥ − 1 /(2μ)
then we’ll set the distribution to beD 2. Then we have thatLD 2 ( ˆw)≥ 1 /4 while
minwLD 2 (w) = 0, soAfails onD 2. In summary, we have shown that for every
Athere exists a distribution on whichAfails, which implies that the problem is
not PAC learnable.


A possible solution to this problem is to add another constraint on the hypoth-
esis class. In addition to the convexity requirement, we require thatHwill be
bounded; namely, we assume that for some predefined scalarB, every hypothesis
w∈Hsatisfies‖w‖≤B.
Boundedness and convexity alone are still not sufficient for ensuring that the
problem is learnable, as the following example demonstrates.


Example 12.9 As in Example 12.8, consider a regression problem with the
squared loss. However, this time letH={w:|w| ≤ 1 } ⊂Rbe a bounded


(^1) Namely, givenSthe output ofAis determined. This requirement is for the sake of
simplicity. A slightly more involved argument will show that nondeterministic algorithms
will also fail to learn the problem.

Free download pdf