Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

194 Stochastic Gradient Descent


Then, for everyu∈H,

‖w−u‖^2 −‖v−u‖^2 ≥ 0.

Proof By the convexity ofH, for everyα∈(0,1) we have thatv+α(u−v)∈H.
Therefore, from the optimality ofvwe obtain

‖v−w‖^2 ≤‖v+α(u−v)−w‖^2
=‖v−w‖^2 + 2α〈v−w,u−v〉+α^2 ‖u−v‖^2.

Rearranging, we obtain

2 〈v−w,u−v〉≥−α‖u−v‖^2.

Taking the limitα→0 we get that

〈v−w,u−v〉≥ 0.

Therefore,

‖w−u‖^2 =‖w−v+v−u‖^2
=‖w−v‖^2 +‖v−u‖^2 + 2〈v−w,u−v〉
≥‖v−u‖^2.

Equipped with the preceding lemma, we can easily adapt the analysis of SGD
to the case in which we add projection steps on a closed and convex set. Simply
note that for everyt,

‖w(t+1)−w?‖^2 −‖w(t)−w?‖^2
=‖w(t+1)−w?‖^2 −‖w(t+

(^12) )
−w?‖^2 +‖w(t+
(^12) )
−w?‖^2 −‖w(t)−w?‖^2
≤‖w(t+
(^12) )
−w?‖^2 −‖w(t)−w?‖^2.
Therefore, Lemma 14.1 holds when we add projection steps and hence the rest
of the analysis follows directly.


14.4.2 Variable Step Size


Another variant of SGD is decreasing the step size as a function oft. That is,
rather than updating with a constantη, we useηt. For instance, we can set
ηt=ρB√tand achieve a bound similar to Theorem 14.8. The idea is that when
we are closer to the minimum of the function, we take our steps more carefully,
so as not to “overshoot” the minimum.
Free download pdf