Understanding Machine Learning: From Theory to Algorithms

378 Rademacher Complexities

Furthermore, ifh?= argminhLD(h)then for eachδ∈(0,1)with probability of at least 1 −δover the choice ofSwe have

LD(ERMH(S))−LD(h?) ≤

2 ES′∼DmR(`◦H◦S′) δ

.

Proof The first inequality follows directly from Lemma 26.2. The second inequality follows because for any fixedh?,

LD(h?) =ES[LS(h?)]≥ES[LS(ERMH(S))].

The third inequality follows from the previous inequality by relying on Markov’s inequality (note that the random variableLD(ERMH(S))−LD(h?) is nonnega- tive).

Next, we derive bounds similar to the bounds in Theorem 26.3 with a better dependence on the confidence parameterδ. To do so, we first introduce the following bounded differences concentration inequality.

lemma26.4 (McDiarmid’s Inequality) LetVbe some set and letf:Vm→R be a function ofmvariables such that for somec > 0 , for alli∈[m]and for all x 1 ,...,xm,x′i∈Vwe have

|f(x 1 ,...,xm)−f(x 1 ,...,xi− 1 ,x′i,xi+1,...,xm)|≤c.

LetX 1 ,...,Xmbemindependent random variables taking values inV. Then, with probability of at least 1 −δwe have

|f(X 1 ,...,Xm)−E[f(X 1 ,...,Xm)]|≤c

√

ln

( 2

δ

)

m/ 2.

On the basis of the McDiarmid inequality we can derive generalization bounds with a better dependence on the confidence parameter.

theorem26.5 Assume that for allzandh∈ Hwe have that|`(h,z)| ≤c. Then,

With probability of at least 1 −δ, for allh∈H,

LD(h)−LS(h) ≤ 2 E S′∼Dm

R(`◦H◦S′) +c

√

2 ln(2/δ) m

.

In particular, this holds forh= ERMH(S).

With probability of at least 1 −δ, for allh∈H,

LD(h)−LS(h) ≤ 2 R(`◦H◦S) + 4c

√

2 ln(4/δ) m

.

In particular, this holds forh= ERMH(S).

For anyh?, with probability of at least 1 −δ,

LD(ERMH(S))−LD(h?) ≤ 2 R(`◦H◦S) + 5c

√

2 ln (8/δ) m

.

Understanding Machine Learning: From Theory to Algorithms

.

√

( 2

)

√

.

√

.

√

.

Get our desktop app

Company

Features

Documentation

Resources