Understanding Machine Learning: From Theory to Algorithms

192 Stochastic Gradient Descent

subgradient off atw(t), we can still derive a similar bound on theexpected output of stochastic gradient descent. This is formalized in the following theorem.

theorem14.8 LetB,ρ > 0. Letfbe a convex function and letw?∈argminw:‖w‖≤Bf(w). Assume that SGD is run forTiterations withη=

√

B^2 ρ^2 T. Assume also that for allt,‖vt‖≤ρwith probability 1. Then,

E[f(w ̄)]−f(w?)≤B ρ√ T

.

Therefore, for any > 0 , to achieveE[f(w ̄)]−f(w?)≤, it suffices to run the SGD algorithm for a number of iterations that satisfies

T≥

B^2 ρ^2 ^2. Proof Let us introduce the notationv1:t to denote the sequencev 1 ,...,vt. Taking expectation of Equation (14.2), we obtain

vE 1:T

[f(w ̄)−f(w?)]≤vE 1:T

[

1 T

∑T

t=1

(f(w(t))−f(w?))

]

.

Since Lemma 14.1 holds for any sequencev 1 ,v 2 ,...vT, it applies to SGD as well. By taking expectation of the bound in the lemma we have

vE 1:T

[

1

T

∑T

t=1

〈w(t)−w?,vt〉

]

≤

B ρ √ T

. (14.9)

It is left to show that

vE 1:T

[

1 T

∑T

t=1

(f(w(t))−f(w?))

]

≤vE 1:T

[

1 T

∑T

t=1

〈w(t)−w?,vt〉

]

, (14.10)

which we will hereby prove. Using the linearity of the expectation we have

vE 1:T

[

1

T

∑T

t=1

〈w(t)−w?,vt〉

]

=

1

T

∑T

t=1 vE 1:T

[〈w(t)−w?,vt〉].

Next, we recall thelaw of total expectation: For every two random variablesα,β, and a functiong,Eα[g(α)] =EβEα[g(α)|β].Settingα=v1:tandβ=v1:t− 1 we get that

vE 1:T

[〈w(t)−w?,vt〉] =vE 1:t

[〈w(t)−w?,vt〉]

=vE 1:t− 1 vE 1:t

[〈w(t)−w?,vt〉|v1:t− 1 ].

Once we knowv1:t− 1 , the value ofw(t)is not random any more and therefore

vE 1:t− 1 vE 1:t

[〈w(t)−w?,vt〉|v1:t− 1 ] =vE 1:t− 1

〈w(t)−w?,vE t

[vt|v1:t− 1 ]〉.

Understanding Machine Learning: From Theory to Algorithms

√

.

[

∑T

]

.

[

1

T

∑T

]

≤

. (14.9)

[

∑T

]

[

∑T

]

, (14.10)

[

1

T

∑T

]

=

1

T

∑T

Get our desktop app

Company

Features

Documentation

Resources