Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
B.5 Bennet’s and Bernstein’s Inequalities 427

Then for all > 0 ,

P

[m

i=1

Zi> 

]

≤e−mσ

(^2) h(mσ 2 )
.
where
h(a) = (1 +a) log(1 +a)−a.
By using the inequalityh(a) ≥a^2 /(2 + 2a/3) it is possible to derive the
following:
lemmaB.9 (Bernstein’s inequality) LetZ 1 ,...,Zmbe i.i.d. random variables
with a zero mean. If for all i,P(|Zi|< M) = 1, then for allt > 0 :


P

[m

i=1

Zi> t

]

≤exp

(


∑ t^2 /^2
EZj^2 +Mt/ 3

)

.

B.5.1 Application


Bernstein’s inequality can be used to interpolate between the rate 1/we derived
for PAC learning in the realizable case (in Chapter 2) and the rate 1/^2 we derived
for the unrealizable case (in Chapter 4).
lemmaB.10 Let`:H×Z→[0,1]be a loss function. LetDbe an arbitrary
distribution overZ. Fix someh. Then, for anyδ∈(0,1)we have

1. S∼DPm

[

LS(h)≥LD(h) +


2 LD(h) log(1/δ)
3 m

+

2 log(1/δ)
m

]

≤δ

2. S∼DPm

[

LD(h)≥LS(h) +


2 LS(h) log(1/δ)
m +

4 log(1/δ)
m

]

≤δ

Proof Define random variablesα 1 ,...,αms.t.αi=`(h,zi)−LD(h). Note that
E[αi] = 0 and that

E[α^2 i] =E[`(h,zi)^2 ]− 2 LD(h)E[`(h,zi)] +LD(h)^2
=E[`(h,zi)^2 ]−LD(h)^2
≤E[`(h,zi)^2 ]
≤E[`(h,zi)] =LD(h),
where in the last inequality we used the fact that`(h,zi)∈[0,1] and thus
`(h,zi)^2 ≤`(h,zi). Applying Bernsein’s inequality over theαi’s yields

P

[m

i=1

αi> t

]

≤exp

(


∑ t^2 /^2
Eα^2 j+t/ 3

)

≤exp

(


t^2 / 2
mLD(h) +t/ 3

)

def=δ.
Free download pdf