Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

190 Stochastic Gradient Descent


Example 14.2(A Subgradient of the Hinge Loss) Recall the hinge loss function
from Section 12.3,f(w) = max{ 0 , 1 −y〈w,x〉}for some vectorxand scalary.
To calculate a subgradient of the hinge loss at somewwe rely on the preceding
claim and obtain that the vectorvdefined in the following is a subgradient of
the hinge loss atw:

v=

{

0 if 1−y〈w,x〉≤ 0
−yx if 1−y〈w,x〉> 0

14.2.2 Subgradients of Lipschitz Functions


Recall that a functionf:A→Risρ-Lipschitz if for allu,v∈A
|f(u)−f(v)|≤ρ‖u−v‖.

The following lemma gives an equivalent definition using norms of subgradients.
lemma14.7 LetAbe a convex open set and letf:A→Rbe a convex function.
Then,f isρ-Lipschitz overAiff for allw∈Aandv∈∂f(w)we have that
‖v‖≤ρ.
Proof Assume that for allv∈∂f(w) we have that‖v‖ ≤ρ. Sincev∈∂f(w)
we have
f(w)−f(u)≤〈v,w−u〉.

Bounding the right-hand side using Cauchy-Schwartz inequality we obtain
f(w)−f(u) ≤ 〈v,w−u〉≤‖v‖‖w−u‖≤ρ‖w−u‖.

An analogous argument can show thatf(u)−f(w)≤ρ‖w−u‖. Hencefis
ρ-Lipschitz.
Now assume thatfisρ-Lipschitz. Choose somew∈A,v∈∂f(w). SinceA
is open, there exists >0 such thatu=w+v/‖v‖belongs toA. Therefore,
〈u−w,v〉=‖v‖and‖u−w‖=. From the definition of the subgradient,
f(u)−f(w)≥〈v,u−w〉=‖v‖.
On the other hand, from the Lipschitzness offwe have

ρ=ρ‖u−w‖≥f(u)−f(w).
Combining the two inequalities we conclude that‖v‖≤ρ.

14.2.3 Subgradient Descent


The gradient descent algorithm can be generalized to nondifferentiable functions
by using a subgradient off(w) atw(t), instead of the gradient. The analysis of
the convergence rate remains unchanged: Simply note that Equation (14.3) is
true for subgradients as well.
Free download pdf