Understanding Machine Learning: From Theory to Algorithms

386 Rademacher Complexities

Remark 26.2 Note that all the bounds we have derived do not depend on the dimension ofw. This property is utilized when learning SVM with kernels, where the dimension ofwcan be extremely large.

26.4 Generalization Bounds for Predictors with Low` 1 Norm

In the previous section we derived generalization bounds for linear predictors with an` 2 -norm constraint. In this section we consider the following general` 1 - norm constraint formulation. LetH={w:‖w‖ 1 ≤B}be our hypothesis class, and letZ=X ×Ybe the examples domain. Assume that the loss function, `:H×Z→R, is of the same form as in Equation (26.18), withφ:R×Y →R beingρ-Lipschitz w.r.t. its first argument. The following theorem bounds the generalization error of all predictors inHusing their empirical error.

theorem26.15 Suppose thatDis a distribution overX ×Ysuch that with probability 1 we have that‖x‖∞≤R. LetH={w∈Rd :‖w‖ 1 ≤B}and let`:H ×Z→Rbe a loss function of the form given in Equation (26.18) such that for ally∈ Y,a7→φ(a,y)is anρ-Lipschitz function and such that maxa∈[−BR,BR]|φ(a,y)|≤c. Then, for anyδ∈(0,1), with probability of at least 1 −δover the choice of an i.i.d. sample of sizem,

∀w∈H, LD(w)≤LS(w) + 2ρBR

√

2 log(2d) m +c

√

2 ln(2/δ) m. Proof The proof is identical to the proof of Theorem 26.12, while relying on Lemma 26.11 instead of relying on Lemma 26.10.

It is interesting to compare the two bounds given in Theorem 26.12 and The- orem 26.15. Apart from the extra log(d) factor that appears in Theorem 26.15, both bounds look similar. However, the parametersB,Rhave different meanings in the two bounds. In Theorem 26.12, the parameterBimposes an` 2 constraint onwand the parameterRcaptures a low` 2 -norm assumption on the instances. In contrast, in Theorem 26.15 the parameterBimposes an` 1 constraint onw (which is stronger than an` 2 constraint) while the parameterRcaptures a low `∞-norm assumption on the instance (which is weaker than a low` 2 -norm assumption). Therefore, the choice of the constraint should depend on our prior knowledge of the set of instances and on prior assumptions on good predictors.

26.5 Bibliographic Remarks

The use of Rademacher complexity for bounding the uniform convergence is due to (Koltchinskii & Panchenko 2000, Bartlett & Mendelson 2001, Bartlett & Mendelson 2002). For additional reading see, for example, (Bousquet 2002, Boucheron, Bousquet & Lugosi 2005, Bartlett, Bousquet & Mendelson 2005).

Understanding Machine Learning: From Theory to Algorithms

26.4 Generalization Bounds for Predictors with Low` 1 Norm

√

√

26.5 Bibliographic Remarks

Get our desktop app

Company

Features

Documentation

Resources