Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

198 Stochastic Gradient Descent


LD(w)with a number of iterations (i.e., number of examples)

T≥

B^2 ρ^2
^2

and withη=


B^2
ρ^2 T, then the output of SGD satisfies
E[LD(w ̄)]≤min
w∈H

LD(w) +.

It is interesting to note that the required sample complexity is of the same order
of magnitude as the sample complexity guarantee we derived for regularized loss
minimization. In fact, the sample complexity of SGD is even better than what
we have derived for regularized loss minimization by a factor of 8.

14.5.2 Analyzing SGD for Convex-Smooth Learning Problems


In the previous chapter we saw that the regularized loss minimization rule also
learns the class of convex-smooth-bounded learning problems. We now show that
the SGD algorithm can be also used for such problems.

theorem14.13 Assume that for allz, the loss function`(·,z)is convex,β-
smooth, and nonnegative. Then, if we run the SGD algorithm for minimizing
LD(w)we have that for everyw?,

E[LD(w ̄)]≤^1
1 −ηβ

(

LD(w?) +‖w

?‖ 2

2 η T

)

.

Proof Recall that if a function isβ-smooth and nonnegative then it is self-
bounded:
‖∇f(w)‖^2 ≤ 2 βf(w).

To analyze SGD for convex-smooth problems, let us definez 1 ,...,zTthe random
samples of the SGD algorithm, letft(·) =`(·,zt), and note thatvt=∇ft(w(t)).
For allt,ftis a convex function and thereforeft(w(t))−ft(w?)≤〈vt,w(t)−w?〉.
Summing overtand using Lemma 14.1 we obtain
∑T

t=1

(ft(w(t))−ft(w?))≤

∑T

t=1

〈vt,w(t)−w?〉≤‖w

?‖ 2

2 η


2

∑T

t=1

‖vt‖^2.

Combining the preceding with the self-boundedness offtyields
∑T

t=1

(ft(w(t))−ft(w?))≤

‖w?‖^2
2 η

+ηβ

∑T

t=1

ft(w(t)).

Dividing byTand rearranging, we obtain

1
T

∑T

t=1

ft(w(t))≤

1

1 −ηβ

(

1

T

∑T

t=1

ft(w?) +

‖w?‖^2
2 η T

)

.

Next, we take expectation of the two sides of the preceding equation with respect
Free download pdf