Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
28.2 The Lower Bound for the Agnostic Case 395

Therefore,


max
b∈{± 1 }

P[LDb(A(y))−LDb(hb) =]

= max
b∈{± 1 }


y

Pb[y] (^1) [A(y) 6 =b]



1

2


y

P+[y] (^1) [A(y) 6 =+]+


1

2


y

P−[y] (^1) [A(y) 6 =−]


=

1

2


y∈N+

(P+[y] (^1) [A(y) 6 =+]+P−[y] (^1) [A(y) 6 =−]) +


1

2


y∈N−

(P+[y] (^1) [A(y) 6 =+]+P−[y] (^1) [A(y) 6 =−])



1

2


y∈N+

(P−[y] (^1) [A(y) 6 =+]+P−[y] (^1) [A(y) 6 =−]) +


1

2


y∈N−

(P+[y] (^1) [A(y) 6 =+]+P+[y] (^1) [A(y) 6 =−])


=

1

2


y∈N+

P−[y] +

1

2


y∈N−

P+[y].

Next note that



y∈N+P−[y] =


y∈N−P+[y], and both values are the prob-
ability that a Binomial (m,(1−)/2) random variable will have value greater
thanm/2. Using Lemma B.11, this probability is lower bounded by


1
2

(

1 −


1 −exp(−m^2 /(1−^2 ))

)

≥^1

2

(

1 −


1 −exp(− 2 m^2 )

)

,

where we used the assumption that^2 ≤ 1 /2. It follows that ifm≤ 0 .5 log(1/(4δ))/^2
then there existsbsuch that


P[LDb(A(y))−LDb(hb) =]

≥^1
2

(

1 −


1 −


4 δ

)

≥δ,

where the last inequality follows by standard algebraic manipulations. This con-
cludes our proof.


28.2.2 Showing Thatm(, 1 /8)≥ 8 d/


We shall now prove that for every < 1 /(8



2) we have thatm(,δ)≥^8 d 2.
Letρ= 8and note thatρ∈(0, 1 /


2). We will construct a family of distri-
butions as follows. First, letC={c 1 ,...,cd}be a set ofdinstances which are
shattered byH. Second, for each vector (b 1 ,...,bd)∈ {± 1 }d, define a distribu-
tionDbsuch that


Db({(x,y)}) =

{ 1


1+ybiρ
2 if∃i:x=ci
0 otherwise.

That is, to sample an example according toDb, we first sample an elementci∈C
uniformly at random, and then set the label to bebiwith probability (1 +ρ)/ 2
or−biwith probability (1−ρ)/2.
It is easy to verify that the Bayes optimal predictor forDbis the hypothesis

Free download pdf