Understanding Machine Learning: From Theory to Algorithms

194 Stochastic Gradient Descent

Then, for everyu∈H,

‖w−u‖^2 −‖v−u‖^2 ≥ 0.

Proof By the convexity ofH, for everyα∈(0,1) we have thatv+α(u−v)∈H. Therefore, from the optimality ofvwe obtain

‖v−w‖^2 ≤‖v+α(u−v)−w‖^2 =‖v−w‖^2 + 2α〈v−w,u−v〉+α^2 ‖u−v‖^2.

Rearranging, we obtain

2 〈v−w,u−v〉≥−α‖u−v‖^2.

Taking the limitα→0 we get that

〈v−w,u−v〉≥ 0.

Therefore,

‖w−u‖^2 =‖w−v+v−u‖^2 =‖w−v‖^2 +‖v−u‖^2 + 2〈v−w,u−v〉 ≥‖v−u‖^2.

Equipped with the preceding lemma, we can easily adapt the analysis of SGD to the case in which we add projection steps on a closed and convex set. Simply note that for everyt,

‖w(t+1)−w?‖^2 −‖w(t)−w?‖^2 =‖w(t+1)−w?‖^2 −‖w(t+

(^12) )
−w?‖^2 +‖w(t+
(^12) )
−w?‖^2 −‖w(t)−w?‖^2
≤‖w(t+
(^12) )
−w?‖^2 −‖w(t)−w?‖^2.
Therefore, Lemma 14.1 holds when we add projection steps and hence the rest
of the analysis follows directly.

14.4.2 Variable Step Size

Another variant of SGD is decreasing the step size as a function oft. That is, rather than updating with a constantη, we useηt. For instance, we can set ηt=ρB√tand achieve a bound similar to Theorem 14.8. The idea is that when we are closer to the minimum of the function, we take our steps more carefully, so as not to “overshoot” the minimum.

Understanding Machine Learning: From Theory to Algorithms

14.4.2 Variable Step Size

Get our desktop app

Company

Features

Documentation

Resources