Understanding Machine Learning: From Theory to Algorithms

27.2 From Covering to Rademacher Complexity via Chaining 389

Next, we derive a contraction principle.

lemma27.3 For eachi ∈[m], letφi :R →Rbe aρ-Lipschitz function; namely, for allα,β∈Rwe have|φi(α)−φi(β)| ≤ρ|α−β|. Fora∈Rmlet φ(a)denote the vector(φ 1 (a 1 ),...,φm(am)). Letφ◦A={φ(a) :a∈A}. Then,

N(ρr,φ◦A)≤N(r,A).

Proof DefineB=φ◦A. LetA′be anr-cover ofAand defineB′=φ◦A′. Then, for alla∈Athere existsa′∈A′with‖a−a′‖≤r. So,

‖φ(a)−φ(a′)‖^2 =

∑

i

(φi(ai)−φi(a′i))^2 ≤ρ^2

∑

i

(ai−a′i)^2 ≤(ρr)^2.

Hence,B′is an (ρr)-cover ofB.

27.2 From Covering to Rademacher Complexity via Chaining

The following lemma bounds the Rademacher complexity ofAbased on the covering numbersN(r,A). This technique is calledChainingand is attributed to Dudley.

lemma27.4 Letc= mina ̄maxa∈A‖a− ̄a‖. Then, for any integerM > 0 ,

R(A)≤

c 2 −M √ m

+

6 c m

∑M

k=1

2 −k

√

log(N(c 2 −k,A)).

Proof Let ̄abe a minimizer of the objective function given in the definition ofc. On the basis of Lemma 26.6, we can analyze the Rademacher complexity assuming that ̄a= 0. Consider the setB 0 ={ 0 }and note that it is ac-cover ofA. LetB 1 ,...,BM be sets such that eachBk corresponds to a minimal (c 2 −k)-cover ofA. Let a∗= argmaxa∈A〈σ,a〉(where if there is more than one maximizer, choose one in an arbitrary way, and if a maximizer does not exist, choosea∗such that 〈σ,a∗〉is close enough to the supremum). Note thata∗is a function ofσ. For eachk, letb(k)be the nearest neighbor ofa∗inBk(henceb(k)is also a function ofσ). Using the triangle inequality,

‖b(k)−b(k−1)‖≤‖b(k)−a∗‖+‖a∗−b(k−1)‖≤c(2−k+ 2−(k−1)) = 3c 2 −k.

For eachkdefine the set

Bˆk={(a−a′) :a∈Bk,a′∈Bk− 1 ,‖a−a′‖≤ 3 c 2 −k}.

Understanding Machine Learning: From Theory to Algorithms

∑

∑

27.2 From Covering to Rademacher Complexity via Chaining

R(A)≤

+

∑M

√

Get our desktop app

Company

Features

Documentation

Resources