Understanding Machine Learning: From Theory to Algorithms

B.5 Bennet’s and Bernstein’s Inequalities 427

Then for all > 0 ,

P

[m ∑

i=1

Zi>

]

≤e−mσ

(^2) h(mσ 2 )
.
where
h(a) = (1 +a) log(1 +a)−a.
By using the inequalityh(a) ≥a^2 /(2 + 2a/3) it is possible to derive the
following:
lemmaB.9 (Bernstein’s inequality) LetZ 1 ,...,Zmbe i.i.d. random variables
with a zero mean. If for all i,P(|Zi|< M) = 1, then for allt > 0 :

P

[m ∑

i=1

Zi> t

]

≤exp

(

−

∑ t^2 /^2 EZj^2 +Mt/ 3

)

.

B.5.1 Application

Bernstein’s inequality can be used to interpolate between the rate 1/we derived for PAC learning in the realizable case (in Chapter 2) and the rate 1/^2 we derived for the unrealizable case (in Chapter 4). lemmaB.10 Let`:H×Z→[0,1]be a loss function. LetDbe an arbitrary distribution overZ. Fix someh. Then, for anyδ∈(0,1)we have

1. S∼DPm

[

LS(h)≥LD(h) +

√

2 LD(h) log(1/δ) 3 m

+

2 log(1/δ) m

]

≤δ

2. S∼DPm

[

LD(h)≥LS(h) +

√

2 LS(h) log(1/δ) m +

4 log(1/δ) m

]

≤δ

Proof Define random variablesα 1 ,...,αms.t.αi=`(h,zi)−LD(h). Note that E[αi] = 0 and that

E[α^2 i] =E[`(h,zi)^2 ]− 2 LD(h)E[`(h,zi)] +LD(h)^2 =E[`(h,zi)^2 ]−LD(h)^2 ≤E[`(h,zi)^2 ] ≤E[`(h,zi)] =LD(h), where in the last inequality we used the fact that`(h,zi)∈[0,1] and thus `(h,zi)^2 ≤`(h,zi). Applying Bernsein’s inequality over theαi’s yields

P

[m ∑

i=1

αi> t

]

≤exp

(

−

∑ t^2 /^2 Eα^2 j+t/ 3

)

≤exp

(

−

t^2 / 2 mLD(h) +t/ 3

)

def=δ.

Understanding Machine Learning: From Theory to Algorithms

]

P

]

(

−

)

.

B.5.1 Application

[

√

+

]

[

√

]

P

]

(

−

)

(

−

)

Get our desktop app

Company

Features

Documentation

Resources