Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
19.2 Analysis 263

Since the number of boxes isr= (1/)dwe get that


S,Ex[‖x−xπ^1 (x)‖]≤


d

(

2 d−d
me +

)

Combining the preceding with Lemma 19.1 we obtain that


ES[LD(hS)]≤ 2 LD(h?) +c


d

(

2 d−d
me +

)

Finally, setting= 2m−^1 /(d+1)and noting that


2 d−d
me

+=

2 d 2 −dmd/(d+1)
me

+ 2m−^1 /(d+1)

=m−^1 /(d+1)(1/e+ 2)≤ 4 m−^1 /(d+1)

we conclude our proof.


The theorem implies that if we first fix the data-generating distribution and
then letmgo to infinity, then the error of the 1-NN rule converges to twice the
Bayes error. The analysis can be generalized to larger values ofk, showing that
the expected error of thek-NN rule converges to (1 +



8 /k) times the error of
the Bayes classifier. This is formalized in Theorem 19.5, whose proof is left as a
guided exercise.


19.2.2 The “Curse of Dimensionality”


The upper bound given in Theorem 19.3 grows withc(the Lipschitz coefficient
ofη) and withd, the Euclidean dimension of the domain setX. In fact, it is easy
to see that a necessary condition for the last term in Theorem 19.3 to be smaller
thanis thatm≥(4c



d/)d+1. That is, the size of the training set should
increase exponentially with the dimension. The following theorem tells us that
this is not just an artifact of our upper bound, but, for some distributions, this
amount of examples is indeed necessary for learning with the NN rule.


theorem19.4 For anyc > 1 , and every learning rule,L, there exists a
distribution over[0,1]d×{ 0 , 1 }, such thatη(x)isc-Lipschitz, the Bayes error of
the distribution is 0 , but for sample sizesm≤(c+ 1)d/ 2 , the true error of the
ruleLis greater than 1 / 4.


Proof Fix any values ofcandd. LetGdcbe the grid on [0,1]dwith distance of
1 /cbetween points on the grid. That is, each point on the grid is of the form
(a 1 /c,...,ad/c) whereaiis in{ 0 ,...,c− 1 ,c}. Note that, since any two distinct
points on this grid are at least 1/capart, any functionη:GDC →[0,1] is a
c-Lipschitz function. It follows that the set of allc-Lipschitz functions overGdc
contains the set ofallbinary valued functions over that domain. We can therefore
invoke the No-Free-Lunch result (Theorem 5.1) to obtain a lower bound on the
needed sample sizes for learning that class. The number of points on the grid is
(c+ 1)d; hence, ifm <(c+ 1)d/2, Theorem 5.1 implies the lower bound we are
after.

Free download pdf