12.2 Convex Learning Problems 165
homogenous case). LetAbe any deterministic algorithm.^1 Assume, by way of
contradiction, thatAis a successful PAC learner for this problem. That is, there
exists a functionm(·,·), such that for every distributionDand for every,δif
Areceives a training set of sizem≥m(,δ), it should output, with probability
of at least 1−δ, a hypothesis ˆw=A(S), such thatLD( ˆw)−minwLD(w)≤.
Choose= 1/ 100 ,δ= 1/2, letm≥m(,δ), and setμ=log(100 2 m/99). We will
define two distributions, and will show thatAis likely to fail on at least one
of them. The first distribution,D 1 , is supported on two examples,z 1 = (1,0)
andz 2 = (μ,−1), where the probability mass of the first example isμwhile the
probability mass of the second example is 1−μ. The second distribution,D 2 , is
supported entirely onz 2.
Observe that for both distributions, the probability that all examples of the
training set will be of the second type is at least 99%. This is trivially true for
D 2 , whereas forD 1 , the probability of this event is
(1−μ)m≥e−^2 μm= 0. 99.
Since we assume thatAis a deterministic algorithm, upon receiving a training
set ofmexamples, each of which is (μ,−1), the algorithm will output some ˆw.
Now, if ˆw <− 1 /(2μ), we will set the distribution to beD 1. Hence,
LD 1 ( ˆw)≥μ( ˆw)^2 ≥ 1 /(4μ).
However,
minw LD 1 (w)≤LD 1 (0) = (1−μ).
It follows that
LD 1 ( ˆw)−minw LD 1 (w)≥
1
4 μ
−(1−μ)> .
Therefore, such algorithmAfails onD 1. On the other hand, if ˆw≥ − 1 /(2μ)
then we’ll set the distribution to beD 2. Then we have thatLD 2 ( ˆw)≥ 1 /4 while
minwLD 2 (w) = 0, soAfails onD 2. In summary, we have shown that for every
Athere exists a distribution on whichAfails, which implies that the problem is
not PAC learnable.
A possible solution to this problem is to add another constraint on the hypoth-
esis class. In addition to the convexity requirement, we require thatHwill be
bounded; namely, we assume that for some predefined scalarB, every hypothesis
w∈Hsatisfies‖w‖≤B.
Boundedness and convexity alone are still not sufficient for ensuring that the
problem is learnable, as the following example demonstrates.
Example 12.9 As in Example 12.8, consider a regression problem with the
squared loss. However, this time letH={w:|w| ≤ 1 } ⊂Rbe a bounded
(^1) Namely, givenSthe output ofAis determined. This requirement is for the sake of
simplicity. A slightly more involved argument will show that nondeterministic algorithms
will also fail to learn the problem.