396 Proof of the Fundamental Theorem of Learning Theory
h∈ Hsuch thath(ci) =bifor alli∈[d], and its error is^1 − 2 ρ. In addition, for
any other functionf:X →{± 1 }, it is easy to verify thatLDb(f) =1 +ρ
2·|{i∈[d] :f(ci)^6 =bi}|
d+^1 −ρ
2·|{i∈[d] :f(ci) =bi}|
d.
Therefore,LDb(f)−minh∈HLDb(h) =ρ·|{i∈[d] :f(ci) 6 =bi}|
d. (28.2)
Next, fix some learning algorithmA. As in the proof of the No-Free-Lunch
theorem, we have thatmax
Db:b∈{± 1 }dE
S∼Dbm[
LDb(A(S))−min
h∈HLDb(h)]
(28.3)
≥ E
Db:b∼U({± 1 }d)E
S∼Dmb[
LDb(A(S))−min
h∈HLDb(h)]
(28.4)
= E
Db:b∼U({± 1 }d)
S∼DEm
b[
ρ·|{i∈[d] :A(S)(ci) 6 =bi|
d]
(28.5)
=
ρ
d∑di=1E
Db:b∼U({± 1 }d)
S∼DEm
b(^1) [A(S)(ci) 6 =bi], (28.6)
where the first equality follows from Equation (28.2). In addition, using the
definition ofDb, to sampleS∼Dbwe can first sample (j 1 ,...,jm)∼U([d])m, set
xr=cji, and finally sampleyrsuch thatP[yr=bji] = (1 +ρ)/2. Let us simplify
the notation and usey∼bto denote sampling according toP[y=b] = (1 +ρ)/2.
Therefore, the right-hand side of Equation (28.6) equals
ρ
d
∑d
i=1
E
j∼U([d])mE
b∼U({± 1 }d)E
∀r,yr∼bjr(^1) [A(S)(ci) 6 =bi]. (28.7)
We now proceed in two steps. First, we show that among all learning algorithms,
A, the one which minimizes Equation (28.7) (and hence also Equation (28.4))
is the Maximum-Likelihood learning rule, denotedAML. Formally, for eachi,
AML(S)(ci) is the majority vote among the set{yr:r∈[m],xr=ci}. Second,
we lower bound Equation (28.7) forAML.
lemma28.1 Among all algorithms, Equation (28.4) is minimized forAbeing
the Maximum-Likelihood algorithm,AML, defined as
∀i, AML(S)(ci) =sign
(
∑
r:xr=ciyr)
.
Proof Fix somej∈[d]m. Note that givenjandy∈ {± 1 }m, the training set
Sis fully determined. Therefore, we can writeA(j,y) instead ofA(S). Let us
also fixi∈[d]. Denoteb¬ithe sequence (b 1 ,...,bi− 1 ,bi+1,...,bm). Also, for any