Understanding Machine Learning: From Theory to Algorithms

384 Rademacher Complexities

We shall consider the following general constraint-based formulation. LetH= {w:‖w‖ 2 ≤B}be our hypothesis class, and letZ=X ×Ybe the examples domain. Assume that the loss function`:H×Z→Ris of the form

`(w,(x,y)) =φ(〈w,x〉,y), (26.18)

whereφ:R×Y →Ris such that for ally∈Y, the scalar functiona7→φ(a,y) isρ-Lipschitz. For example, the hinge-loss function,`(w,(x,y)) = max{ 0 , 1 − y〈w,x〉}, can be written as in Equation (26.18) usingφ(a,y) = max{ 0 , 1 − ya}, and note thatφis 1-Lipschitz for ally∈ {± 1 }. Another example is the absolute loss function,`(w,(x,y)) =|〈w,x〉−y|, which can be written as in Equation (26.18) usingφ(a,y) =|a−y|, which is also 1-Lipschitz for ally∈R. The following theorem bounds the generalization error of all predictors inH using their empirical error.

theorem26.12 Suppose thatDis a distribution overX ×Ysuch that with probability 1 we have that‖x‖ 2 ≤ R. LetH = {w : ‖w‖ 2 ≤B}and let ` : H ×Z → Rbe a loss function of the form given in Equation (26.18) such that for ally∈ Y,a7→φ(a,y)is aρ-Lipschitz function and such that maxa∈[−BR,BR]|φ(a,y)|≤c. Then, for anyδ∈(0,1), with probability of at least 1 −δover the choice of an i.i.d. sample of sizem,

∀w∈H, LD(w)≤LS(w) +^2 ρBR√ m

+c

√

2 ln(2/δ) m

.

Proof LetF = {(x,y)7→ φ(〈w,x〉,y) :w ∈ H}. We will show that with probability 1,R(F◦S)≤ρBR/

√

mand then the theorem will follow from Theorem 26.5. Indeed, the setF◦Scan be written as

F◦S={(φ(〈w,x 1 〉,y 1 ),...,φ(〈w,xm〉,ym)) :w∈H},

and the bound onR(F◦S) follows directly by combining Lemma 26.9, Lemma 26.10, and the assumption that‖x‖ 2 ≤Rwith probability 1.

We next derive a generalization bound for hard-SVM based on the previous theorem. For simplicity, we do not allow a bias term and consider the hard-SVM problem: argmin w

‖w‖^2 s.t. ∀i, yi〈w,xi〉≥ 1 (26.19)

theorem26.13 Consider a distributionDoverX×{± 1 }such that there exists some vectorw?withP(x,y)∼D[y〈w?,x〉 ≥1] = 1and such that‖x‖ 2 ≤Rwith probability 1. LetwSbe the output of Equation (26.19). Then, with probability of at least 1 −δover the choice ofS∼Dm, we have that

P

(x,y)∼D

[y 6 =sign(〈wS,x〉)]≤

2 R‖w?‖ √ m

+ (1 +R‖w?‖)

√

2 ln(2/δ) m

.

Understanding Machine Learning: From Theory to Algorithms

√

.

√

P

√

.

Get our desktop app

Company

Features

Documentation

Resources