Understanding Machine Learning: From Theory to Algorithms

394 Proof of the Fundamental Theorem of Learning Theory

that there areh+,h−∈ Hfor whichh+(c) = 1 andh−(c) =−1. Define two distributions,D+andD−, such that forb∈{± 1 }we have

Db({(x,y)}) =

{1+yb 2 ifx=c 0 otherwise.

That is, all the distribution mass is concentrated on two examples (c,1) and (c,−1), where the probability of (c,b) is1+ 2 b and the probability of (c,−b) is 1 −b 2. LetAbe an arbitrary algorithm. Any training set sampled fromDbhas the formS= (c,y 1 ),...,(c,ym). Therefore, it is fully characterized by the vector y= (y 1 ,...,ym)∈ {± 1 }m. Upon receiving a training setS, the algorithmA returns a hypothesish:X →{± 1 }. Since the error ofAw.r.t.Dbonly depends onh(c), we can think ofAas a mapping from{± 1 }minto{± 1 }. Therefore, we denote byA(y) the value in{± 1 }corresponding to the prediction ofh(c), wherehis the hypothesis thatAoutputs upon receiving the training setS= (c,y 1 ),...,(c,ym). Note that for any hypothesishwe have

LDb(h) =^1 −h(c)b 2

.

In particular, the Bayes optimal hypothesis ishband

LDb(A(y))−LDb(hb) =

1 −A(y)b 2

−

1 −

2

=

{

ifA(y) 6 =b 0 otherwise.

FixA. Forb∈{± 1 }, letYb={y∈{ 0 , 1 }m:A(y) 6 =b}. The distributionDb induces a probabilityPbover{± 1 }m. Hence,

P[LDb(A(y))−LDb(hb) =] =Db(Yb) =

∑

y

Pb[y] (^1) [A(y) 6 =b].
DenoteN+={y:|{i:yi= 1}|≥m/ 2 }andN−={± 1 }m\N+. Note that for
anyy∈N+we haveP+[y]≥P−[y] and for anyy∈N−we haveP−[y]≥P+[y].

Understanding Machine Learning: From Theory to Algorithms

.

−

1 −

2

=

{

∑

Get our desktop app

Company

Features

Documentation

Resources

Understanding Machine Learning: From Theory to Algorithms

.

−

1 −

2

=

{

∑

Get our desktop app

Company

Features

Documentation

Resources

1 −