Understanding Machine Learning: From Theory to Algorithms

190 Stochastic Gradient Descent

Example 14.2(A Subgradient of the Hinge Loss) Recall the hinge loss function from Section 12.3,f(w) = max{ 0 , 1 −y〈w,x〉}for some vectorxand scalary. To calculate a subgradient of the hinge loss at somewwe rely on the preceding claim and obtain that the vectorvdefined in the following is a subgradient of the hinge loss atw:

v=

{

0 if 1−y〈w,x〉≤ 0 −yx if 1−y〈w,x〉> 0

14.2.2 Subgradients of Lipschitz Functions

Recall that a functionf:A→Risρ-Lipschitz if for allu,v∈A |f(u)−f(v)|≤ρ‖u−v‖.

The following lemma gives an equivalent definition using norms of subgradients. lemma14.7 LetAbe a convex open set and letf:A→Rbe a convex function. Then,f isρ-Lipschitz overAiff for allw∈Aandv∈∂f(w)we have that ‖v‖≤ρ. Proof Assume that for allv∈∂f(w) we have that‖v‖ ≤ρ. Sincev∈∂f(w) we have f(w)−f(u)≤〈v,w−u〉.

Bounding the right-hand side using Cauchy-Schwartz inequality we obtain f(w)−f(u) ≤ 〈v,w−u〉≤‖v‖‖w−u‖≤ρ‖w−u‖.

An analogous argument can show thatf(u)−f(w)≤ρ‖w−u‖. Hencefis ρ-Lipschitz. Now assume thatfisρ-Lipschitz. Choose somew∈A,v∈∂f(w). SinceA is open, there exists >0 such thatu=w+v/‖v‖belongs toA. Therefore, 〈u−w,v〉=‖v‖and‖u−w‖=. From the definition of the subgradient, f(u)−f(w)≥〈v,u−w〉=‖v‖. On the other hand, from the Lipschitzness offwe have

ρ=ρ‖u−w‖≥f(u)−f(w). Combining the two inequalities we conclude that‖v‖≤ρ.

14.2.3 Subgradient Descent

The gradient descent algorithm can be generalized to nondifferentiable functions by using a subgradient off(w) atw(t), instead of the gradient. The analysis of the convergence rate remains unchanged: Simply note that Equation (14.3) is true for subgradients as well.

Understanding Machine Learning: From Theory to Algorithms

{

14.2.2 Subgradients of Lipschitz Functions

14.2.3 Subgradient Descent

Get our desktop app

Company

Features

Documentation

Resources