192 Stochastic Gradient Descent
subgradient off atw(t), we can still derive a similar bound on theexpected
output of stochastic gradient descent. This is formalized in the following theorem.
theorem14.8 LetB,ρ > 0. Letfbe a convex function and letw?∈argminw:‖w‖≤Bf(w).
Assume that SGD is run forTiterations withη=
√
B^2
ρ^2 T. Assume also that for
allt,‖vt‖≤ρwith probability 1. Then,
E[f(w ̄)]−f(w?)≤B ρ√
T
.
Therefore, for any > 0 , to achieveE[f(w ̄)]−f(w?)≤, it suffices to run the
SGD algorithm for a number of iterations that satisfies
T≥
B^2 ρ^2
^2.
Proof Let us introduce the notationv1:t to denote the sequencev 1 ,...,vt.
Taking expectation of Equation (14.2), we obtain
vE
1:T
[f(w ̄)−f(w?)]≤vE
1:T
[
1
T
∑T
t=1
(f(w(t))−f(w?))
]
.
Since Lemma 14.1 holds for any sequencev 1 ,v 2 ,...vT, it applies to SGD as well.
By taking expectation of the bound in the lemma we have
vE
1:T
[
1
T
∑T
t=1
〈w(t)−w?,vt〉
]
≤
B ρ
√
T
. (14.9)
It is left to show that
vE
1:T
[
1
T
∑T
t=1
(f(w(t))−f(w?))
]
≤vE
1:T
[
1
T
∑T
t=1
〈w(t)−w?,vt〉
]
, (14.10)
which we will hereby prove.
Using the linearity of the expectation we have
vE
1:T
[
1
T
∑T
t=1
〈w(t)−w?,vt〉
]
=
1
T
∑T
t=1
vE
1:T
[〈w(t)−w?,vt〉].
Next, we recall thelaw of total expectation: For every two random variablesα,β,
and a functiong,Eα[g(α)] =EβEα[g(α)|β].Settingα=v1:tandβ=v1:t− 1 we
get that
vE
1:T
[〈w(t)−w?,vt〉] =vE
1:t
[〈w(t)−w?,vt〉]
=vE
1:t− 1
vE
1:t
[〈w(t)−w?,vt〉|v1:t− 1 ].
Once we knowv1:t− 1 , the value ofw(t)is not random any more and therefore
vE
1:t− 1
vE
1:t
[〈w(t)−w?,vt〉|v1:t− 1 ] =vE
1:t− 1
〈w(t)−w?,vE
t
[vt|v1:t− 1 ]〉.