Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
19.2 Analysis 261

Proof SinceLD(hS) =E(x,y)∼D[ (^1) [hS(x) 6 =y]], we obtain thatES[LD(hS)] is the
probability to sample a training setSand an additional example (x,y), such
that the label ofπ 1 (x) is different fromy. In other words, we can first sample
munlabeled examples,Sx= (x 1 ,...,xm), according toDX, and an additional
unlabeled example,x∼ DX, then findπ 1 (x) to be the nearest neighbor ofxin
Sx, and finally sampley∼η(x) andyπ 1 (x)∼η(π 1 (x)). It follows that
ES[LD(hS)] = E
Sx∼DmX,x∼DX,y∼η(x),y′∼η(π 1 (x))
[ (^1) [y 6 =y′]]


= E

Sx∼DmX,x∼DX

[

P

y∼η(x),y′∼η(π 1 (x))

[y 6 =y′]

]

. (19.2)

We next upper boundPy∼η(x),y′∼η(x′)[y 6 =y′] for any two domain pointsx,x′:


P
y∼η(x),y′∼η(x′)

[y 6 =y′] =η(x′)(1−η(x)) + (1−η(x′))η(x)

= (η(x)−η(x) +η(x′))(1−η(x))
+ (1−η(x) +η(x)−η(x′))η(x)
= 2η(x)(1−η(x)) + (η(x)−η(x′))(2η(x)−1).

Using| 2 η(x)− 1 | ≤1 and the assumption thatηisc-Lipschitz, we obtain that
the probability is at most:


P
y∼η(x),y′∼η(x′)

[y 6 =y′]≤ 2 η(x)(1−η(x)) +c‖x−x′‖.

Plugging this into Equation (19.2) we conclude that


ES[LD(hS)] ≤ Ex[2η(x)(1−η(x))] +cS,Ex[‖x−xπ 1 (x)‖].

Finally, the error of the Bayes optimal classifier is


LD(h?) = Ex[min{η(x), 1 −η(x)}]≥Ex[η(x)(1−η(x))].

Combining the preceding two inequalities concludes our proof.


The next step is to bound the expected distance between a randomxand its
closest element inS. We first need the following general probability lemma. The
lemma bounds the probability weight of subsets that are not hit by a random
sample, as a function of the size of that sample.


lemma19.2 LetC 1 ,...,Crbe a collection of subsets of some domain set,X.
LetS be a sequence ofmpoints sampled i.i.d. according to some probability
distribution,DoverX. Then,


E

S∼Dm




i:Ci∩S=∅

P[Ci]


 ≤ r
me

.
Free download pdf