Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

396 Proof of the Fundamental Theorem of Learning Theory


h∈ Hsuch thath(ci) =bifor alli∈[d], and its error is^1 − 2 ρ. In addition, for
any other functionf:X →{± 1 }, it is easy to verify that

LDb(f) =1 +ρ
2

·|{i∈[d] :f(ci)^6 =bi}|
d

+^1 −ρ
2

·|{i∈[d] :f(ci) =bi}|
d

.

Therefore,

LDb(f)−minh∈HLDb(h) =ρ·

|{i∈[d] :f(ci) 6 =bi}|
d

. (28.2)

Next, fix some learning algorithmA. As in the proof of the No-Free-Lunch
theorem, we have that

max
Db:b∈{± 1 }d

E

S∼Dbm

[

LDb(A(S))−min
h∈H

LDb(h)

]

(28.3)

≥ E

Db:b∼U({± 1 }d)

E

S∼Dmb

[

LDb(A(S))−min
h∈H

LDb(h)

]

(28.4)

= E

Db:b∼U({± 1 }d)
S∼DEm
b

[

ρ·

|{i∈[d] :A(S)(ci) 6 =bi|
d

]

(28.5)

=

ρ
d

∑d

i=1

E

Db:b∼U({± 1 }d)
S∼DEm
b

(^1) [A(S)(ci) 6 =bi], (28.6)
where the first equality follows from Equation (28.2). In addition, using the
definition ofDb, to sampleS∼Dbwe can first sample (j 1 ,...,jm)∼U([d])m, set
xr=cji, and finally sampleyrsuch thatP[yr=bji] = (1 +ρ)/2. Let us simplify
the notation and usey∼bto denote sampling according toP[y=b] = (1 +ρ)/2.
Therefore, the right-hand side of Equation (28.6) equals
ρ
d
∑d
i=1


E

j∼U([d])m

E

b∼U({± 1 }d)

E

∀r,yr∼bjr

(^1) [A(S)(ci) 6 =bi]. (28.7)
We now proceed in two steps. First, we show that among all learning algorithms,
A, the one which minimizes Equation (28.7) (and hence also Equation (28.4))
is the Maximum-Likelihood learning rule, denotedAML. Formally, for eachi,
AML(S)(ci) is the majority vote among the set{yr:r∈[m],xr=ci}. Second,
we lower bound Equation (28.7) forAML.
lemma28.1 Among all algorithms, Equation (28.4) is minimized forAbeing
the Maximum-Likelihood algorithm,AML, defined as
∀i, AML(S)(ci) =sign


(


r:xr=ci

yr

)

.

Proof Fix somej∈[d]m. Note that givenjandy∈ {± 1 }m, the training set
Sis fully determined. Therefore, we can writeA(j,y) instead ofA(S). Let us
also fixi∈[d]. Denoteb¬ithe sequence (b 1 ,...,bi− 1 ,bi+1,...,bm). Also, for any
Free download pdf