Understanding Machine Learning: From Theory to Algorithms

56 Learning via Uniform Convergence

i.i.d. fromDwe have that for allh∈H,|LS(h)−LD(h)|≤. That is,

Dm({S:∀h∈H,|LS(h)−LD(h)|≤})≥ 1 −δ.

Equivalently, we need to show that

Dm({S:∃h∈H,|LS(h)−LD(h)|> })< δ.

Writing

{S:∃h∈H,|LS(h)−LD(h)|> }=∪h∈H{S:|LS(h)−LD(h)|> },

and applying the union bound (Lemma 2.2) we obtain

Dm({S:∃h∈H,|LS(h)−LD(h)|> })≤

∑

h∈H

Dm({S:|LS(h)−LD(h)|> }).

(4.1) Our second step will be to argue that each summand of the right-hand side of this inequality is small enough (for a sufficiently largem). That is, we will show that for any fixed hypothesis,h, (which is chosen in advance prior to the sampling of the training set), the gap between the true and empirical risks, |LS(h)−LD(h)|, is likely to be small. Recall thatLD(h) =Ez∼D[`(h,z)] and thatLS(h) =m^1

∑m i=1`(h,zi). Since eachziis sampled i.i.d. from D, the expected value of the random variable `(h,zi) isLD(h). By the linearity of expectation, it follows thatLD(h) is also the expected value ofLS(h). Hence, the quantity|LD(h)−LS(h)|is the deviation of the random variableLS(h) from its expectation. We therefore need to show that the measure ofLS(h) isconcentratedaround its expected value. A basic statistical fact, thelaw of large numbers, states that whenmgoes to infinity, empirical averages converge to their true expectation. This is true for LS(h), since it is the empirical average ofmi.i.d random variables. However, since the law of large numbers is only an asymptotic result, it provides no information about the gap between the empirically estimated error and its true value for any given, finite, sample size. Instead, we will use a measure concentration inequality due to Hoeffding, which quantifies the gap between empirical averages and their expected value.

lemma4.5 (Hoeffding’s Inequality) Letθ 1 ,...,θmbe a sequence of i.i.d. random variables and assume that for alli,E[θi] =μandP[a≤θi≤b] = 1. Then, for any > 0

P

[∣∣

∣∣

∣

1 m

∑m

i=1

θi−μ

∣∣

∣

>

]

≤ 2 exp

(

− 2 m^2 /(b−a)^2

)

.

The proof can be found in Appendix B. Getting back to our problem, letθibe the random variable`(h,zi). Sinceh is fixed andz 1 ,...,zmare sampled i.i.d., it follows thatθ 1 ,...,θmare also i.i.d. random variables. Furthermore,LS(h) = m^1

∑m i=1θiandLD(h) =μ. Let us

Understanding Machine Learning: From Theory to Algorithms

∑

P

[∣∣

∣∣

∣

∣∣

∣∣

∣

>

]

(

)

.

Get our desktop app

Company

Features

Documentation

Resources

Understanding Machine Learning: From Theory to Algorithms

∑

P

[∣∣

∣∣

∣

∣∣

∣∣

∣

> 

]

(

)

.

Get our desktop app

Company

Features

Documentation

Resources

>