Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

378 Rademacher Complexities


Furthermore, ifh?= argminhLD(h)then for eachδ∈(0,1)with probability of
at least 1 −δover the choice ofSwe have

LD(ERMH(S))−LD(h?) ≤

2 ES′∼DmR(`◦H◦S′)
δ

.

Proof The first inequality follows directly from Lemma 26.2. The second in-
equality follows because for any fixedh?,

LD(h?) =ES[LS(h?)]≥ES[LS(ERMH(S))].

The third inequality follows from the previous inequality by relying on Markov’s
inequality (note that the random variableLD(ERMH(S))−LD(h?) is nonnega-
tive).

Next, we derive bounds similar to the bounds in Theorem 26.3 with a better
dependence on the confidence parameterδ. To do so, we first introduce the
following bounded differences concentration inequality.

lemma26.4 (McDiarmid’s Inequality) LetVbe some set and letf:Vm→R
be a function ofmvariables such that for somec > 0 , for alli∈[m]and for all
x 1 ,...,xm,x′i∈Vwe have

|f(x 1 ,...,xm)−f(x 1 ,...,xi− 1 ,x′i,xi+1,...,xm)|≤c.

LetX 1 ,...,Xmbemindependent random variables taking values inV. Then,
with probability of at least 1 −δwe have

|f(X 1 ,...,Xm)−E[f(X 1 ,...,Xm)]|≤c


ln

( 2

δ

)

m/ 2.

On the basis of the McDiarmid inequality we can derive generalization bounds
with a better dependence on the confidence parameter.

theorem26.5 Assume that for allzandh∈ Hwe have that|`(h,z)| ≤c.
Then,


  1. With probability of at least 1 −δ, for allh∈H,


LD(h)−LS(h) ≤ 2 E
S′∼Dm

R(`◦H◦S′) +c


2 ln(2/δ)
m

.

In particular, this holds forh= ERMH(S).


  1. With probability of at least 1 −δ, for allh∈H,


LD(h)−LS(h) ≤ 2 R(`◦H◦S) + 4c


2 ln(4/δ)
m

.

In particular, this holds forh= ERMH(S).


  1. For anyh?, with probability of at least 1 −δ,


LD(ERMH(S))−LD(h?) ≤ 2 R(`◦H◦S) + 5c


2 ln (8/δ)
m

.
Free download pdf