62 The Bias-Complexity Tradeoff
C×{ 0 , 1 }defined byDi({(x,y)}) ={
1 /|C| ify=fi(x)
0 otherwise.That is, the probability to choose a pair (x,y) is 1/|C|if the labelyis indeed
the true label according tofi, and the probability is 0 ify 6 =fi(x). Clearly,
LDi(fi) = 0.
We will show that for every algorithm,A, that receives a training set ofm
examples fromC×{ 0 , 1 }and returns a functionA(S) :C→{ 0 , 1 }, it holds thatmax
i∈[T]
S∼DEm
i[LDi(A(S))]≥ 1 / 4. (5.1)Clearly, this means that for every algorithm,A′, that receives a training set ofm
examples fromX×{ 0 , 1 }there exist a functionf:X →{ 0 , 1 }and a distribution
DoverX ×{ 0 , 1 }, such thatLD(f) = 0 andE
S∼Dm[LD(A′(S))]≥ 1 / 4. (5.2)
It is easy to verify that the preceding suffices for showing thatP[LD(A′(S))≥
1 /8]≥ 1 /7, which is what we need to prove (see Exercise 1).
We now turn to proving that Equation (5.1) holds. There arek = (2m)m
possible sequences ofmexamples fromC. Denote these sequences byS 1 ,...,Sk.
Also, ifSj= (x 1 ,...,xm) we denote bySijthe sequence containing the instances
inSjlabeled by the functionfi, namely,Sji= ((x 1 ,fi(x 1 )),...,(xm,fi(xm))). If
the distribution isDithen the possible training setsAcan receive areS 1 i,...,Sik,
and all these training sets have the same probability of being sampled. Therefore,E
S∼Dmi[LDi(A(S))] =^1
k∑kj=1LDi(A(Sji)). (5.3)Using the facts that “maximum” is larger than “average” and that “average” is
larger than “minimum,” we havemax
i∈[T]1
k∑kj=1LDi(A(Sji))≥1
T
∑T
i=11
k∑kj=1LDi(A(Sji))=
1
k∑kj=11
T
∑T
i=1LDi(A(Sji))≥ min
j∈[k]1
T
∑T
i=1LDi(A(Sij)). (5.4)Next, fix somej ∈[k]. DenoteSj = (x 1 ,...,xm) and letv 1 ,...,vpbe the
examples inCthat do not appear inSj. Clearly,p≥m. Therefore, for every