Understanding Machine Learning: From Theory to Algorithms

19.2 Analysis 261

Proof SinceLD(hS) =E(x,y)∼D[ (^1) [hS(x) 6 =y]], we obtain thatES[LD(hS)] is the
probability to sample a training setSand an additional example (x,y), such
that the label ofπ 1 (x) is different fromy. In other words, we can first sample
munlabeled examples,Sx= (x 1 ,...,xm), according toDX, and an additional
unlabeled example,x∼ DX, then findπ 1 (x) to be the nearest neighbor ofxin
Sx, and finally sampley∼η(x) andyπ 1 (x)∼η(π 1 (x)). It follows that
ES[LD(hS)] = E
Sx∼DmX,x∼DX,y∼η(x),y′∼η(π 1 (x))
[ (^1) [y 6 =y′]]

= E

Sx∼DmX,x∼DX

[

P

y∼η(x),y′∼η(π 1 (x))

[y 6 =y′]

]

. (19.2)

We next upper boundPy∼η(x),y′∼η(x′)[y 6 =y′] for any two domain pointsx,x′:

P y∼η(x),y′∼η(x′)

[y 6 =y′] =η(x′)(1−η(x)) + (1−η(x′))η(x)

= (η(x)−η(x) +η(x′))(1−η(x)) + (1−η(x) +η(x)−η(x′))η(x) = 2η(x)(1−η(x)) + (η(x)−η(x′))(2η(x)−1).

Using| 2 η(x)− 1 | ≤1 and the assumption thatηisc-Lipschitz, we obtain that
the probability is at most:

P y∼η(x),y′∼η(x′)

[y 6 =y′]≤ 2 η(x)(1−η(x)) +c‖x−x′‖.

Plugging this into Equation (19.2) we conclude that

ES[LD(hS)] ≤ Ex[2η(x)(1−η(x))] +cS,Ex[‖x−xπ 1 (x)‖].

Finally, the error of the Bayes optimal classifier is

LD(h?) = Ex[min{η(x), 1 −η(x)}]≥Ex[η(x)(1−η(x))].

Combining the preceding two inequalities concludes our proof.

The next step is to bound the expected distance between a randomxand its
closest element inS. We first need the following general probability lemma. The
lemma bounds the probability weight of subsets that are not hit by a random
sample, as a function of the size of that sample.

lemma19.2 LetC 1 ,...,Crbe a collection of subsets of some domain set,X.
LetS be a sequence ofmpoints sampled i.i.d. according to some probability
distribution,DoverX. Then,

E

S∼Dm





∑

i:Ci∩S=∅

P[Ci]



 ≤ r me

.

Understanding Machine Learning: From Theory to Algorithms

= E

[

P

]

. (19.2)

E





∑



.

Get our desktop app

Company

Features

Documentation

Resources