Understanding Machine Learning: From Theory to Algorithms

186 Stochastic Gradient Descent

Figure 14.1An illustration of the gradient descent algorithm. The function to be minimized is 1.25(x 1 + 6)^2 + (x 2 −8)^2.

14.1.1 Analysis of GD for Convex-Lipschitz Functions

To analyze the convergence rate of the GD algorithm, we limit ourselves to the case of convex-Lipschitz functions (as we have seen, many problems lend themselves easily to this setting). Letw?be any vector and letBbe an upper bound on‖w?‖. It is convenient to think ofw?as the minimizer off(w), but the analysis that follows holds for everyw?. We would like to obtain an upper bound on the suboptimality of our solution with respect tow?, namely,f(w ̄)−f(w?), wherew ̄=T^1

∑T

t=1w

(t). From the definition ofw ̄, and using Jensen’s inequality, we have that

f(w ̄)−f(w?) =f

(

1 T

∑T

t=1

w(t)

)

−f(w?)

≤

1

T

∑T

t=1

(

f(w(t))

)

−f(w?)

=

1

T

∑T

t=1

(

f(w(t))−f(w?)

)

. (14.2)

For everyt, because of the convexity off, we have that

f(w(t))−f(w?) ≤ 〈w(t)−w?,∇f(w(t))〉. (14.3)

Combining the preceding we obtain

f(w ̄)−f(w?) ≤

1

T

∑T

t=1

〈w(t)−w?,∇f(w(t))〉.

To bound the right-hand side we rely on the following lemma:

Understanding Machine Learning: From Theory to Algorithms

14.1.1 Analysis of GD for Convex-Lipschitz Functions

∑T

(

∑T

)

≤

1

T

∑T

(

)

=

1

T

∑T

(

)

. (14.2)

1

T

∑T

Get our desktop app

Company

Features

Documentation

Resources