62 The Bias-Complexity Tradeoff
C×{ 0 , 1 }defined by
Di({(x,y)}) =
{
1 /|C| ify=fi(x)
0 otherwise.
That is, the probability to choose a pair (x,y) is 1/|C|if the labelyis indeed
the true label according tofi, and the probability is 0 ify 6 =fi(x). Clearly,
LDi(fi) = 0.
We will show that for every algorithm,A, that receives a training set ofm
examples fromC×{ 0 , 1 }and returns a functionA(S) :C→{ 0 , 1 }, it holds that
max
i∈[T]
S∼DEm
i
[LDi(A(S))]≥ 1 / 4. (5.1)
Clearly, this means that for every algorithm,A′, that receives a training set ofm
examples fromX×{ 0 , 1 }there exist a functionf:X →{ 0 , 1 }and a distribution
DoverX ×{ 0 , 1 }, such thatLD(f) = 0 and
E
S∼Dm
[LD(A′(S))]≥ 1 / 4. (5.2)
It is easy to verify that the preceding suffices for showing thatP[LD(A′(S))≥
1 /8]≥ 1 /7, which is what we need to prove (see Exercise 1).
We now turn to proving that Equation (5.1) holds. There arek = (2m)m
possible sequences ofmexamples fromC. Denote these sequences byS 1 ,...,Sk.
Also, ifSj= (x 1 ,...,xm) we denote bySijthe sequence containing the instances
inSjlabeled by the functionfi, namely,Sji= ((x 1 ,fi(x 1 )),...,(xm,fi(xm))). If
the distribution isDithen the possible training setsAcan receive areS 1 i,...,Sik,
and all these training sets have the same probability of being sampled. Therefore,
E
S∼Dmi
[LDi(A(S))] =^1
k
∑k
j=1
LDi(A(Sji)). (5.3)
Using the facts that “maximum” is larger than “average” and that “average” is
larger than “minimum,” we have
max
i∈[T]
1
k
∑k
j=1
LDi(A(Sji))≥
1
T
∑T
i=1
1
k
∑k
j=1
LDi(A(Sji))
=
1
k
∑k
j=1
1
T
∑T
i=1
LDi(A(Sji))
≥ min
j∈[k]
1
T
∑T
i=1
LDi(A(Sij)). (5.4)
Next, fix somej ∈[k]. DenoteSj = (x 1 ,...,xm) and letv 1 ,...,vpbe the
examples inCthat do not appear inSj. Clearly,p≥m. Therefore, for every