384 Rademacher Complexities
We shall consider the following general constraint-based formulation. LetH=
{w:‖w‖ 2 ≤B}be our hypothesis class, and letZ=X ×Ybe the examples
domain. Assume that the loss function`:H×Z→Ris of the form
`(w,(x,y)) =φ(〈w,x〉,y), (26.18)
whereφ:R×Y →Ris such that for ally∈Y, the scalar functiona7→φ(a,y)
isρ-Lipschitz. For example, the hinge-loss function,`(w,(x,y)) = max{ 0 , 1 −
y〈w,x〉}, can be written as in Equation (26.18) usingφ(a,y) = max{ 0 , 1 −
ya}, and note thatφis 1-Lipschitz for ally∈ {± 1 }. Another example is the
absolute loss function,`(w,(x,y)) =|〈w,x〉−y|, which can be written as in
Equation (26.18) usingφ(a,y) =|a−y|, which is also 1-Lipschitz for ally∈R.
The following theorem bounds the generalization error of all predictors inH
using their empirical error.
theorem26.12 Suppose thatDis a distribution overX ×Ysuch that with
probability 1 we have that‖x‖ 2 ≤ R. LetH = {w : ‖w‖ 2 ≤B}and let
` : H ×Z → Rbe a loss function of the form given in Equation (26.18)
such that for ally∈ Y,a7→φ(a,y)is aρ-Lipschitz function and such that
maxa∈[−BR,BR]|φ(a,y)|≤c. Then, for anyδ∈(0,1), with probability of at least
1 −δover the choice of an i.i.d. sample of sizem,
∀w∈H, LD(w)≤LS(w) +^2 ρBR√
m
+c
√
2 ln(2/δ)
m
.
Proof LetF = {(x,y)7→ φ(〈w,x〉,y) :w ∈ H}. We will show that with
probability 1,R(F◦S)≤ρBR/
√
mand then the theorem will follow from
Theorem 26.5. Indeed, the setF◦Scan be written as
F◦S={(φ(〈w,x 1 〉,y 1 ),...,φ(〈w,xm〉,ym)) :w∈H},
and the bound onR(F◦S) follows directly by combining Lemma 26.9, Lemma 26.10,
and the assumption that‖x‖ 2 ≤Rwith probability 1.
We next derive a generalization bound for hard-SVM based on the previous
theorem. For simplicity, we do not allow a bias term and consider the hard-SVM
problem:
argmin
w
‖w‖^2 s.t. ∀i, yi〈w,xi〉≥ 1 (26.19)
theorem26.13 Consider a distributionDoverX×{± 1 }such that there exists
some vectorw?withP(x,y)∼D[y〈w?,x〉 ≥1] = 1and such that‖x‖ 2 ≤Rwith
probability 1. LetwSbe the output of Equation (26.19). Then, with probability
of at least 1 −δover the choice ofS∼Dm, we have that
P
(x,y)∼D
[y 6 =sign(〈wS,x〉)]≤
2 R‖w?‖
√
m
+ (1 +R‖w?‖)
√
2 ln(2/δ)
m