Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

192 Stochastic Gradient Descent


subgradient off atw(t), we can still derive a similar bound on theexpected
output of stochastic gradient descent. This is formalized in the following theorem.

theorem14.8 LetB,ρ > 0. Letfbe a convex function and letw?∈argminw:‖w‖≤Bf(w).
Assume that SGD is run forTiterations withη=


B^2
ρ^2 T. Assume also that for
allt,‖vt‖≤ρwith probability 1. Then,

E[f(w ̄)]−f(w?)≤B ρ√
T

.

Therefore, for any > 0 , to achieveE[f(w ̄)]−f(w?)≤, it suffices to run the
SGD algorithm for a number of iterations that satisfies

T≥

B^2 ρ^2
^2.
Proof Let us introduce the notationv1:t to denote the sequencev 1 ,...,vt.
Taking expectation of Equation (14.2), we obtain

vE
1:T

[f(w ̄)−f(w?)]≤vE
1:T

[

1
T

∑T

t=1

(f(w(t))−f(w?))

]

.

Since Lemma 14.1 holds for any sequencev 1 ,v 2 ,...vT, it applies to SGD as well.
By taking expectation of the bound in the lemma we have

vE
1:T

[

1

T

∑T

t=1

〈w(t)−w?,vt〉

]


B ρ

T

. (14.9)

It is left to show that

vE
1:T

[

1
T

∑T

t=1

(f(w(t))−f(w?))

]

≤vE
1:T

[

1
T

∑T

t=1

〈w(t)−w?,vt〉

]

, (14.10)

which we will hereby prove.
Using the linearity of the expectation we have

vE
1:T

[

1

T

∑T

t=1

〈w(t)−w?,vt〉

]

=

1

T

∑T

t=1
vE
1:T

[〈w(t)−w?,vt〉].

Next, we recall thelaw of total expectation: For every two random variablesα,β,
and a functiong,Eα[g(α)] =EβEα[g(α)|β].Settingα=v1:tandβ=v1:t− 1 we
get that

vE
1:T

[〈w(t)−w?,vt〉] =vE
1:t

[〈w(t)−w?,vt〉]

=vE
1:t− 1
vE
1:t

[〈w(t)−w?,vt〉|v1:t− 1 ].

Once we knowv1:t− 1 , the value ofw(t)is not random any more and therefore

vE
1:t− 1
vE
1:t

[〈w(t)−w?,vt〉|v1:t− 1 ] =vE
1:t− 1

〈w(t)−w?,vE
t

[vt|v1:t− 1 ]〉.
Free download pdf