Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

394 Proof of the Fundamental Theorem of Learning Theory


that there areh+,h−∈ Hfor whichh+(c) = 1 andh−(c) =−1. Define two
distributions,D+andD−, such that forb∈{± 1 }we have

Db({(x,y)}) =

{1+yb
2 ifx=c
0 otherwise.

That is, all the distribution mass is concentrated on two examples (c,1) and
(c,−1), where the probability of (c,b) is1+ 2 b and the probability of (c,−b) is
1 −b
2.
LetAbe an arbitrary algorithm. Any training set sampled fromDbhas the
formS= (c,y 1 ),...,(c,ym). Therefore, it is fully characterized by the vector
y= (y 1 ,...,ym)∈ {± 1 }m. Upon receiving a training setS, the algorithmA
returns a hypothesish:X →{± 1 }. Since the error ofAw.r.t.Dbonly depends
onh(c), we can think ofAas a mapping from{± 1 }minto{± 1 }. Therefore,
we denote byA(y) the value in{± 1 }corresponding to the prediction ofh(c),
wherehis the hypothesis thatAoutputs upon receiving the training setS=
(c,y 1 ),...,(c,ym).
Note that for any hypothesishwe have

LDb(h) =^1 −h(c)b
2

.

In particular, the Bayes optimal hypothesis ishband

LDb(A(y))−LDb(hb) =

1 −A(y)b
2


1 −

2

=

{

 ifA(y) 6 =b
0 otherwise.

FixA. Forb∈{± 1 }, letYb={y∈{ 0 , 1 }m:A(y) 6 =b}. The distributionDb
induces a probabilityPbover{± 1 }m. Hence,

P[LDb(A(y))−LDb(hb) =] =Db(Yb) =


y

Pb[y] (^1) [A(y) 6 =b].
DenoteN+={y:|{i:yi= 1}|≥m/ 2 }andN−={± 1 }m\N+. Note that for
anyy∈N+we haveP+[y]≥P−[y] and for anyy∈N−we haveP−[y]≥P+[y].

Free download pdf