Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

62 The Bias-Complexity Tradeoff


C×{ 0 , 1 }defined by

Di({(x,y)}) =

{

1 /|C| ify=fi(x)
0 otherwise.

That is, the probability to choose a pair (x,y) is 1/|C|if the labelyis indeed
the true label according tofi, and the probability is 0 ify 6 =fi(x). Clearly,
LDi(fi) = 0.
We will show that for every algorithm,A, that receives a training set ofm
examples fromC×{ 0 , 1 }and returns a functionA(S) :C→{ 0 , 1 }, it holds that

max
i∈[T]
S∼DEm
i

[LDi(A(S))]≥ 1 / 4. (5.1)

Clearly, this means that for every algorithm,A′, that receives a training set ofm
examples fromX×{ 0 , 1 }there exist a functionf:X →{ 0 , 1 }and a distribution
DoverX ×{ 0 , 1 }, such thatLD(f) = 0 and

E
S∼Dm

[LD(A′(S))]≥ 1 / 4. (5.2)

It is easy to verify that the preceding suffices for showing thatP[LD(A′(S))≥
1 /8]≥ 1 /7, which is what we need to prove (see Exercise 1).
We now turn to proving that Equation (5.1) holds. There arek = (2m)m
possible sequences ofmexamples fromC. Denote these sequences byS 1 ,...,Sk.
Also, ifSj= (x 1 ,...,xm) we denote bySijthe sequence containing the instances
inSjlabeled by the functionfi, namely,Sji= ((x 1 ,fi(x 1 )),...,(xm,fi(xm))). If
the distribution isDithen the possible training setsAcan receive areS 1 i,...,Sik,
and all these training sets have the same probability of being sampled. Therefore,

E

S∼Dmi

[LDi(A(S))] =^1
k

∑k

j=1

LDi(A(Sji)). (5.3)

Using the facts that “maximum” is larger than “average” and that “average” is
larger than “minimum,” we have

max
i∈[T]

1

k

∑k

j=1

LDi(A(Sji))≥

1

T

∑T

i=1

1

k

∑k

j=1

LDi(A(Sji))

=

1

k

∑k

j=1

1

T

∑T

i=1

LDi(A(Sji))

≥ min
j∈[k]

1

T

∑T

i=1

LDi(A(Sij)). (5.4)

Next, fix somej ∈[k]. DenoteSj = (x 1 ,...,xm) and letv 1 ,...,vpbe the
examples inCthat do not appear inSj. Clearly,p≥m. Therefore, for every
Free download pdf