Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

176 Regularization and Stability


DenotefS(w) =LS(w) +λ‖w‖^2 , and based on Lemma 13.5 we know thatfSis
(2λ)-strongly convex. Relying on part 3 of the lemma, it follows that for anyv,

fS(v)−fS(A(S))≥λ‖v−A(S)‖^2. (13.7)

On the other hand, for anyvandu, and for alli, we have

fS(v)−fS(u) =LS(v) +λ‖v‖^2 −(LS(u) +λ‖u‖^2 ) (13.8)
=LS(i)(v) +λ‖v‖^2 −(LS(i)(u) +λ‖u‖^2 )

+

`(v,zi)−`(u,zi)
m

+

`(u,z′)−`(v,z′)
m
In particular, choosingv=A(S(i)),u=A(S), and using the fact thatvmini-
mizesLS(i)(w) +λ‖w‖^2 , we obtain that

fS(A(S(i)))−fS(A(S))≤`(A(S

(i)),zi)−`(A(S),zi)
m

+`(A(S),z

′)−`(A(S(i)),z′)
m
(13.9)
Combining this with Equation (13.7) we obtain that

λ‖A(S(i))−A(S)‖^2 ≤

`(A(S(i)),zi)−`(A(S),zi)
m

+

`(A(S),z′)−`(A(S(i)),z′)
m
(13.10)
The two subsections that follow continue the stability analysis for either Lip-
schitz or smooth loss functions. For both families of loss functions we show that
RLM is stable and therefore it does not overfit.

13.3.1 Lipschitz Loss


If the loss function,`(·,zi), isρ-Lipschitz, then by the definition of Lipschitzness,

`(A(S(i)),zi)−`(A(S),zi)≤ρ‖A(S(i))−A(S)‖. (13.11)

Similarly,
`(A(S),z′)−`(A(S(i)),z′)≤ρ‖A(S(i))−A(S)‖.

Plugging these inequalities into Equation (13.10) we obtain

λ‖A(S(i))−A(S)‖^2 ≤

2 ρ‖A(S(i))−A(S)‖
m

,

which yields

‖A(S(i))−A(S)‖≤^2 ρ
λm

.

Plugging the preceding back into Equation (13.11) we conclude that

`(A(S(i)),zi)−`(A(S),zi)≤

2 ρ^2
λm

.

Since this holds for anyS,z′,iwe immediately obtain:
Free download pdf