Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

384 Rademacher Complexities


We shall consider the following general constraint-based formulation. LetH=
{w:‖w‖ 2 ≤B}be our hypothesis class, and letZ=X ×Ybe the examples
domain. Assume that the loss function`:H×Z→Ris of the form

`(w,(x,y)) =φ(〈w,x〉,y), (26.18)

whereφ:R×Y →Ris such that for ally∈Y, the scalar functiona7→φ(a,y)
isρ-Lipschitz. For example, the hinge-loss function,`(w,(x,y)) = max{ 0 , 1 −
y〈w,x〉}, can be written as in Equation (26.18) usingφ(a,y) = max{ 0 , 1 −
ya}, and note thatφis 1-Lipschitz for ally∈ {± 1 }. Another example is the
absolute loss function,`(w,(x,y)) =|〈w,x〉−y|, which can be written as in
Equation (26.18) usingφ(a,y) =|a−y|, which is also 1-Lipschitz for ally∈R.
The following theorem bounds the generalization error of all predictors inH
using their empirical error.

theorem26.12 Suppose thatDis a distribution overX ×Ysuch that with
probability 1 we have that‖x‖ 2 ≤ R. LetH = {w : ‖w‖ 2 ≤B}and let
` : H ×Z → Rbe a loss function of the form given in Equation (26.18)
such that for ally∈ Y,a7→φ(a,y)is aρ-Lipschitz function and such that
maxa∈[−BR,BR]|φ(a,y)|≤c. Then, for anyδ∈(0,1), with probability of at least
1 −δover the choice of an i.i.d. sample of sizem,

∀w∈H, LD(w)≤LS(w) +^2 ρBR√
m

+c


2 ln(2/δ)
m

.

Proof LetF = {(x,y)7→ φ(〈w,x〉,y) :w ∈ H}. We will show that with
probability 1,R(F◦S)≤ρBR/


mand then the theorem will follow from
Theorem 26.5. Indeed, the setF◦Scan be written as

F◦S={(φ(〈w,x 1 〉,y 1 ),...,φ(〈w,xm〉,ym)) :w∈H},

and the bound onR(F◦S) follows directly by combining Lemma 26.9, Lemma 26.10,
and the assumption that‖x‖ 2 ≤Rwith probability 1.

We next derive a generalization bound for hard-SVM based on the previous
theorem. For simplicity, we do not allow a bias term and consider the hard-SVM
problem:
argmin
w

‖w‖^2 s.t. ∀i, yi〈w,xi〉≥ 1 (26.19)

theorem26.13 Consider a distributionDoverX×{± 1 }such that there exists
some vectorw?withP(x,y)∼D[y〈w?,x〉 ≥1] = 1and such that‖x‖ 2 ≤Rwith
probability 1. LetwSbe the output of Equation (26.19). Then, with probability
of at least 1 −δover the choice ofS∼Dm, we have that

P

(x,y)∼D

[y 6 =sign(〈wS,x〉)]≤

2 R‖w?‖

m

+ (1 +R‖w?‖)


2 ln(2/δ)
m

.
Free download pdf