Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

200 Stochastic Gradient Descent


thatH=Rdand therefore the projection step does not matter) as follows

w(t+1)=w(t)−

1

λt

(

λw(t)+vt

)

=

(

1 −

1

t

)

w(t)−

1

λt

vt

=

t− 1
t w

(t)−^1
λtvt
=

t− 1
t

(

t− 2
t− 1

w(t−1)−

1

λ(t−1)

vt− 1

)


1

λt

vt

=−

1

λt

∑t

i=1

vi. (14.15)

If we assume that the loss function isρ-Lipschitz, it follows that for alltwe have
‖vt‖≤ρand therefore‖λw(t)‖≤ρ, which yields

‖λw(t)+vt‖≤ 2 ρ.

Theorem 14.11 therefore tells us that after performingTiterations we have that

E[f(w ̄)]−f(w?)≤

4 ρ^2
λT

(1 + log(T)).

14.6 Summary


We have introduced the Gradient Descent and Stochastic Gradient Descent algo-
rithms, along with several of their variants. We have analyzed their convergence
rate and calculated the number of iterations that would guarantee an expected
objective of at mostplus the optimal objective. Most importantly, we have
shown that by using SGD we can directly minimize the risk function. We do
so by sampling a point i.i.d fromDand using a subgradient of the loss of the
current hypothesisw(t)at this point as an unbiased estimate of the gradient (or
a subgradient) of the risk function. This implies that a bound on the number of
iterations also yields a sample complexity bound. Finally, we have also shown
how to apply the SGD method to the problem of regularized risk minimization.
In future chapters we show how this yields extremely simple solvers to some
optimization problems associated with regularized risk minimization.

14.7 Bibliographic Remarks


SGD dates back to Robbins & Monro (1951). It is especially effective in large
scale machine learning problems. See, for example, (Murata 1998, Le Cun 2004,
Zhang 2004, Bottou & Bousquet 2008, Shalev-Shwartz, Singer & Srebro 2007,
Shalev-Shwartz & Srebro 2008). In the optimization community it was studied
Free download pdf