396 Proof of the Fundamental Theorem of Learning Theory
h∈ Hsuch thath(ci) =bifor alli∈[d], and its error is^1 − 2 ρ. In addition, for
any other functionf:X →{± 1 }, it is easy to verify that
LDb(f) =1 +ρ
2
·|{i∈[d] :f(ci)^6 =bi}|
d
+^1 −ρ
2
·|{i∈[d] :f(ci) =bi}|
d
.
Therefore,
LDb(f)−minh∈HLDb(h) =ρ·
|{i∈[d] :f(ci) 6 =bi}|
d
. (28.2)
Next, fix some learning algorithmA. As in the proof of the No-Free-Lunch
theorem, we have that
max
Db:b∈{± 1 }d
E
S∼Dbm
[
LDb(A(S))−min
h∈H
LDb(h)
]
(28.3)
≥ E
Db:b∼U({± 1 }d)
E
S∼Dmb
[
LDb(A(S))−min
h∈H
LDb(h)
]
(28.4)
= E
Db:b∼U({± 1 }d)
S∼DEm
b
[
ρ·
|{i∈[d] :A(S)(ci) 6 =bi|
d
]
(28.5)
=
ρ
d
∑d
i=1
E
Db:b∼U({± 1 }d)
S∼DEm
b
(^1) [A(S)(ci) 6 =bi], (28.6)
where the first equality follows from Equation (28.2). In addition, using the
definition ofDb, to sampleS∼Dbwe can first sample (j 1 ,...,jm)∼U([d])m, set
xr=cji, and finally sampleyrsuch thatP[yr=bji] = (1 +ρ)/2. Let us simplify
the notation and usey∼bto denote sampling according toP[y=b] = (1 +ρ)/2.
Therefore, the right-hand side of Equation (28.6) equals
ρ
d
∑d
i=1
E
j∼U([d])m
E
b∼U({± 1 }d)
E
∀r,yr∼bjr
(^1) [A(S)(ci) 6 =bi]. (28.7)
We now proceed in two steps. First, we show that among all learning algorithms,
A, the one which minimizes Equation (28.7) (and hence also Equation (28.4))
is the Maximum-Likelihood learning rule, denotedAML. Formally, for eachi,
AML(S)(ci) is the majority vote among the set{yr:r∈[m],xr=ci}. Second,
we lower bound Equation (28.7) forAML.
lemma28.1 Among all algorithms, Equation (28.4) is minimized forAbeing
the Maximum-Likelihood algorithm,AML, defined as
∀i, AML(S)(ci) =sign
(
∑
r:xr=ci
yr
)
.
Proof Fix somej∈[d]m. Note that givenjandy∈ {± 1 }m, the training set
Sis fully determined. Therefore, we can writeA(j,y) instead ofA(S). Let us
also fixi∈[d]. Denoteb¬ithe sequence (b 1 ,...,bi− 1 ,bi+1,...,bm). Also, for any