Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
14.4 Variants 193

Sincew(t)only depends onv1:t− 1 and SGD requires thatEvt[vt|w(t)]∈∂f(w(t))
we obtain thatEvt[vt|v1:t− 1 ]∈∂f(w(t)). Thus,

vE
1:t− 1

〈w(t)−w?,vE
t

[vt|v1:t− 1 ]〉 ≥ vE
1:t− 1

[f(w(t))−f(w?)].

Overall, we have shown that

vE
1:T

[〈w(t)−w?,vt〉]≥vE
1:t− 1

[f(w(t))−f(w?)]

=vE
1:T

[f(w(t))−f(w?)].

Summing overt, dividing byT, and using the linearity of expectation, we get
that Equation (14.10) holds, which concludes our proof.

14.4 Variants


In this section we describe several variants of Stochastic Gradient Descent.

14.4.1 Adding a Projection Step


In the previous analyses of the GD and SGD algorithms, we required that the
norm ofw?will be at mostB, which is equivalent to requiring thatw?is in the
setH={w:‖w‖≤B}. In terms of learning, this means restricting ourselves to
aB-bounded hypothesis class. Yet any step we take in the opposite direction of
the gradient (or its expected direction) might result in stepping out of this bound,
and there is even no guarantee thatw ̄satisfies it. We show in the following how
to overcome this problem while maintaining the same convergence rate.
The basic idea is to add aprojection step; namely, we will now have a two-step
update rule, where we first subtract a subgradient from the current value ofw
and then project the resulting vector ontoH. Formally,

1..w(t+^12 )=w(t)−ηvt
2..w(t+1)= argminw∈H‖w−w(t+^12 )‖

The projection step replaces the current value ofwby the vector inHclosest
to it.
Clearly, the projection step guarantees thatw(t)∈ Hfor allt. SinceHis
convex this also implies thatw ̄∈Has required. We next show that the analysis
of SGD with projections remains the same. This is based on the following lemma.

lemma14.9 (Projection Lemma) LetHbe a closed convex set and letvbe the
projection ofwontoH, namely,

v= argmin
x∈H

‖x−w‖^2.
Free download pdf