19.2 Analysis 261
Proof SinceLD(hS) =E(x,y)∼D[ (^1) [hS(x) 6 =y]], we obtain thatES[LD(hS)] is the
probability to sample a training setSand an additional example (x,y), such
that the label ofπ 1 (x) is different fromy. In other words, we can first sample
munlabeled examples,Sx= (x 1 ,...,xm), according toDX, and an additional
unlabeled example,x∼ DX, then findπ 1 (x) to be the nearest neighbor ofxin
Sx, and finally sampley∼η(x) andyπ 1 (x)∼η(π 1 (x)). It follows that
ES[LD(hS)] = E
Sx∼DmX,x∼DX,y∼η(x),y′∼η(π 1 (x))
[ (^1) [y 6 =y′]]
= E
Sx∼DmX,x∼DX
[
P
y∼η(x),y′∼η(π 1 (x))
[y 6 =y′]
]
. (19.2)
We next upper boundPy∼η(x),y′∼η(x′)[y 6 =y′] for any two domain pointsx,x′:
P
y∼η(x),y′∼η(x′)
[y 6 =y′] =η(x′)(1−η(x)) + (1−η(x′))η(x)
= (η(x)−η(x) +η(x′))(1−η(x))
+ (1−η(x) +η(x)−η(x′))η(x)
= 2η(x)(1−η(x)) + (η(x)−η(x′))(2η(x)−1).
Using| 2 η(x)− 1 | ≤1 and the assumption thatηisc-Lipschitz, we obtain that
the probability is at most:
P
y∼η(x),y′∼η(x′)
[y 6 =y′]≤ 2 η(x)(1−η(x)) +c‖x−x′‖.
Plugging this into Equation (19.2) we conclude that
ES[LD(hS)] ≤ Ex[2η(x)(1−η(x))] +cS,Ex[‖x−xπ 1 (x)‖].
Finally, the error of the Bayes optimal classifier is
LD(h?) = Ex[min{η(x), 1 −η(x)}]≥Ex[η(x)(1−η(x))].
Combining the preceding two inequalities concludes our proof.
The next step is to bound the expected distance between a randomxand its
closest element inS. We first need the following general probability lemma. The
lemma bounds the probability weight of subsets that are not hit by a random
sample, as a function of the size of that sample.
lemma19.2 LetC 1 ,...,Crbe a collection of subsets of some domain set,X.
LetS be a sequence ofmpoints sampled i.i.d. according to some probability
distribution,DoverX. Then,
E
S∼Dm
∑
i:Ci∩S=∅
P[Ci]
≤ r
me