Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

186 Stochastic Gradient Descent


Figure 14.1An illustration of the gradient descent algorithm. The function to be
minimized is 1.25(x 1 + 6)^2 + (x 2 −8)^2.

14.1.1 Analysis of GD for Convex-Lipschitz Functions


To analyze the convergence rate of the GD algorithm, we limit ourselves to
the case of convex-Lipschitz functions (as we have seen, many problems lend
themselves easily to this setting). Letw?be any vector and letBbe an upper
bound on‖w?‖. It is convenient to think ofw?as the minimizer off(w), but
the analysis that follows holds for everyw?.
We would like to obtain an upper bound on the suboptimality of our solution
with respect tow?, namely,f(w ̄)−f(w?), wherew ̄=T^1

∑T

t=1w

(t). From the
definition ofw ̄, and using Jensen’s inequality, we have that

f(w ̄)−f(w?) =f

(

1
T

∑T

t=1

w(t)

)

−f(w?)


1

T

∑T

t=1

(

f(w(t))

)

−f(w?)

=

1

T

∑T

t=1

(

f(w(t))−f(w?)

)

. (14.2)

For everyt, because of the convexity off, we have that

f(w(t))−f(w?) ≤ 〈w(t)−w?,∇f(w(t))〉. (14.3)

Combining the preceding we obtain

f(w ̄)−f(w?) ≤

1

T

∑T

t=1

〈w(t)−w?,∇f(w(t))〉.

To bound the right-hand side we rely on the following lemma:
Free download pdf