188 Stochastic Gradient Descent
Lemma 14.1 applies to the GD algorithm withvt=∇f(w(t)). As we will
show later in Lemma 14.7, iffisρ-Lipschitz, then‖∇f(w(t))‖≤ρ. We therefore
satisfy the lemma’s conditions and achieve the following corollary:corollary14.2 Letfbe a convex,ρ-Lipschitz function, and letw?∈argmin{w:‖w‖≤B}f(w).
If we run the GD algorithm onfforTsteps withη=√
B^2
ρ^2 T, then the output
vectorw ̄satisfiesf(w ̄)−f(w?) ≤ B ρ√
T
Furthermore, for every > 0 , to achievef(w ̄)−f(w?)≤, it suffices to run the
GD algorithm for a number of iterations that satisfiesT≥B^2 ρ^2
^214.2 Subgradients
The GD algorithm requires that the functionfbe differentiable. We now gener-
alize the discussion beyond differentiable functions. We will show that the GD
algorithm can be applied to nondifferentiable functions by using a so-called sub-
gradient off(w) atw(t), instead of the gradient.
To motivate the definition of subgradients, recall that for a convex functionf,
the gradient atwdefines the slope of a tangent that lies belowf, that is,
∀u, f(u)≥f(w) +〈u−w,∇f(w)〉. (14.7)An illustration is given on the left-hand side of Figure 14.2.
The existence of a tangent that lies belowfis an important property of convex
functions, which is in fact an alternative characterization of convexity.lemma14.3 LetSbe an open convex set. A functionf:S→Ris convex iff
for everyw∈Sthere existsvsuch that∀u∈S, f(u)≥f(w) +〈u−w,v〉. (14.8)The proof of this lemma can be found in many convex analysis textbooks (e.g.,
(Borwein & Lewis 2006)). The preceding inequality leads us to the definition of
subgradients.definition14.4 (Subgradients) A vectorvthat satisfies Equation (14.8) is
called asubgradientoffatw. The set of subgradients offatwis called the
differential setand denoted∂f(w).An illustration of subgradients is given on the right-hand side of Figure 14.2.
For scalar functions, a subgradient of a convex functionfatwis a slope of a
line that touchesfatwand is not abovefelsewhere.