Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
6.5 Proof of Theorem 6.7 77

The expectation on the right-hand side is over a choice of two i.i.d. samples
S=z 1 ,...,zmandS′=z 1 ′,...,z′m. Since all of these 2mvectors are chosen
i.i.d., nothing will change if we replace the name of the random vectorziwith the
name of the random vectorzi′. If we do it, instead of the term ((h,z′i)−(h,zi))
in Equation (6.5) we will have the term−((h,zi′)−(h,zi)). It follows that for
everyσ∈{± 1 }mwe have that Equation (6.5) equals


S,S′E∼Dm

[

sup
h∈H

1

m

∣∣

∣∣


∑m

i=1

σi(`(h,zi′)−`(h,zi))

∣∣

∣∣


]

Since this holds for everyσ∈{± 1 }m, it also holds if we sample each component
ofσuniformly at random from the uniform distribution over{± 1 }, denotedU±.
Hence, Equation (6.5) also equals


E

σ∼Um±

E

S,S′∼Dm

[

sup
h∈H

1

m

∣∣

∣∣


∑m

i=1

σi(`(h,zi′)−`(h,zi))

∣∣

∣∣


]

,

and by the linearity of expectation it also equals


S,S′E∼Dmσ∼EUm
±

[

sup
h∈H

1

m

∣∣

∣∣


∑m

i=1

σi(`(h,zi′)−`(h,zi))

∣∣

∣∣


]

.

Next, fixSandS′, and letCbe the instances appearing inSandS′. Then, we
can take the supremum only overh∈HC. Therefore,


E

σ∼U±m

[

sup
h∈H

1

m

∣∣

∣∣


∑m

i=1

σi(`(h,zi′)−`(h,zi))

∣∣

∣∣


]

= E

σ∼Um±

[

max
h∈HC

1

m

∣∣

∣∣


∑m

i=1

σi(`(h,z′i)−`(h,zi))

∣∣

∣∣


]

.

Fix someh∈HCand denoteθh=m^1


∑m
i=1σi((h,z′i)−(h,zi)). SinceE[θh] = 0
andθhis an average of independent variables, each of which takes values in
[− 1 ,1], we have by Hoeffding’s inequality that for everyρ >0,


P[|θh|> ρ]≤2 exp

(

− 2 mρ^2

)

.

Applying the union bound overh∈HC, we obtain that for anyρ >0,


P

[

max
h∈HC

|θh|> ρ

]

≤ 2 |HC|exp

(

− 2 mρ^2

)

.

Finally, Lemma A.4 in Appendix A tells us that the preceding implies


E

[

hmax∈H
C

|θh|

]


4 +


log(|HC|)

2 m

.

Combining all with the definition ofτH, we have shown that


S∼DEm

[

sup
h∈H

|LD(h)−LS(h)|

]


4 +


log(√ τH(2m))
2 m

.
Free download pdf