Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
27.2 From Covering to Rademacher Complexity via Chaining 389

Next, we derive a contraction principle.

lemma27.3 For eachi ∈[m], letφi :R →Rbe aρ-Lipschitz function;
namely, for allα,β∈Rwe have|φi(α)−φi(β)| ≤ρ|α−β|. Fora∈Rmlet
φ(a)denote the vector(φ 1 (a 1 ),...,φm(am)). Letφ◦A={φ(a) :a∈A}. Then,

N(ρr,φ◦A)≤N(r,A).

Proof DefineB=φ◦A. LetA′be anr-cover ofAand defineB′=φ◦A′.
Then, for alla∈Athere existsa′∈A′with‖a−a′‖≤r. So,

‖φ(a)−φ(a′)‖^2 =


i

(φi(ai)−φi(a′i))^2 ≤ρ^2


i

(ai−a′i)^2 ≤(ρr)^2.

Hence,B′is an (ρr)-cover ofB.

27.2 From Covering to Rademacher Complexity via Chaining


The following lemma bounds the Rademacher complexity ofAbased on the
covering numbersN(r,A). This technique is calledChainingand is attributed
to Dudley.

lemma27.4 Letc= mina ̄maxa∈A‖a− ̄a‖. Then, for any integerM > 0 ,

R(A)≤

c 2 −M

m

+

6 c
m

∑M

k=1

2 −k


log(N(c 2 −k,A)).

Proof Let ̄abe a minimizer of the objective function given in the definition
ofc. On the basis of Lemma 26.6, we can analyze the Rademacher complexity
assuming that ̄a= 0.
Consider the setB 0 ={ 0 }and note that it is ac-cover ofA. LetB 1 ,...,BM
be sets such that eachBk corresponds to a minimal (c 2 −k)-cover ofA. Let
a∗= argmaxa∈A〈σ,a〉(where if there is more than one maximizer, choose one
in an arbitrary way, and if a maximizer does not exist, choosea∗such that
〈σ,a∗〉is close enough to the supremum). Note thata∗is a function ofσ. For
eachk, letb(k)be the nearest neighbor ofa∗inBk(henceb(k)is also a function
ofσ). Using the triangle inequality,

‖b(k)−b(k−1)‖≤‖b(k)−a∗‖+‖a∗−b(k−1)‖≤c(2−k+ 2−(k−1)) = 3c 2 −k.

For eachkdefine the set

Bˆk={(a−a′) :a∈Bk,a′∈Bk− 1 ,‖a−a′‖≤ 3 c 2 −k}.
Free download pdf