Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
26.3 Generalization Bounds for SVM 385

Proof Throughout the proof, let the loss function be the ramp loss (see Sec-
tion 15.2.3). Note that the range of the ramp loss is [0,1] and that it is a
1-Lipschitz function. Since the ramp loss upper bounds the zero-one loss, we
have that


P
(x,y)∼D

[y 6 = sign(〈wS,x〉)]≤LD(wS).

LetB=‖w?‖ 2 and consider the setH={w:‖w‖ 2 ≤B}. By the definition of
hard-SVM and our assumption on the distribution, we have thatwS∈ Hwith
probability 1 and thatLS(wS) = 0. Therefore, using Theorem 26.12 we have
that


LD(wS)≤LS(wS) +

(^2) √BR
m


+


2 ln(2/δ)
m.

Remark 26.1 Theorem 26.13 implies that the sample complexity of hard-SVM
grows likeR


(^2) ‖w?‖ 2
^2. Using a more delicate analysis and the separability assump-
tion, it is possible to improve the bound to an order ofR
(^2) ‖w?‖ 2
.
The bound in the preceding theorem depends on‖w?‖, which is unknown.
In the following we derive a bound that depends on the norm of the output of
SVM; hence it can be calculated from the training set itself. The proof is similar
to the derivation of bounds for structure risk minimization (SRM).
theorem26.14 Assume that the conditions of Theorem 26.13 hold. Then,
with probability of at least 1 −δover the choice ofS∼Dm, we have that


P

(x,y)∼D

[y 6 =sign(〈wS,x〉)]≤

4 R‖wS‖

m

+


ln(4 log^2 (δ‖wS‖))
m

.

Proof For any integeri, letBi= 2i,Hi={w:‖w‖ ≤Bi}, and letδi= 2 δi 2.
Fixi, then using Theorem 26.12 we have that with probability of at least 1−δi


∀w∈Hi, LD(w)≤LS(w) +

(^2) √BiR
m


+


2 ln(2/δi)
m

Applying the union bound and using


∑∞

i=1δi≤δwe obtain that with probability
of at least 1−δthis holds for alli. Therefore, for allw, if we leti=dlog 2 (‖w‖)e


thenw∈Hi,Bi≤ 2 ‖w‖, andδ^2 i=(2i)


2
δ ≤

(4 log 2 (‖w‖))^2
δ. Therefore,

LD(w)≤LS(w) +

2 BiR

m

+


2 ln(2/δi)
m

≤LS(w) +^4 ‖√w‖R
m

+


4(ln(4 log 2 (‖w‖)) + ln(1/δ))
m

.

In particular, it holds forwS, which concludes our proof.

Free download pdf