Understanding Machine Learning: From Theory to Algorithms

14.1 Gradient Descent 187

lemma14.1 Letv 1 ,...,vTbe an arbitrary sequence of vectors. Any algorithm
with an initializationw(1)= 0and an update rule of the form

w(t+1)=w(t)−ηvt (14.4)

satisfies

∑T

t=1

〈w(t)−w?,vt〉≤

‖w?‖^2 2 η

+

η 2

∑T

t=1

‖vt‖^2. (14.5)

In particular, for everyB,ρ > 0 , if for alltwe have that‖vt‖≤ρand if we set

η=

√

B^2 ρ^2 T, then for everyw

?with‖w?‖≤Bwe have

1

T

∑T

t=1

〈w(t)−w?,vt〉≤B ρ√ T

.

Proof Using algebraic manipulations (completing the square), we obtain:

〈w(t)−w?,vt〉=

1

η

〈w(t)−w?,ηvt〉

=

1

2 η

(−‖w(t)−w?−ηvt‖^2 +‖w(t)−w?‖^2 +η^2 ‖vt‖^2 )

=

1

2 η

(−‖w(t+1)−w?‖^2 +‖w(t)−w?‖^2 ) +

η 2

‖vt‖^2 ,

where the last equality follows from the definition of the update rule. Summing
the equality overt, we have

∑T

t=1

〈w(t)−w?,vt〉 =

1

2 η

∑T

t=1

(

−‖w(t+1)−w?‖^2 +‖w(t)−w?‖^2

)

+

η 2

∑T

t=1

‖vt‖^2.

(14.6)
The first sum on the right-hand side is a telescopic sum that collapses to

‖w(1)−w?‖^2 −‖w(T+1)−w?‖^2.

Plugging this in Equation (14.6), we have

∑T

t=1

〈w(t)−w?,vt〉 =

1

2 η

(‖w(1)−w?‖^2 −‖w(T+1)−w?‖^2 ) +

η 2

∑T

t=1

‖vt‖^2

≤

1

2 η

‖w(1)−w?‖^2 +

η 2

∑T

t=1

‖vt‖^2

=

1

2 η

‖w?‖^2 +

η 2

∑T

t=1

‖vt‖^2 ,

where the last equality is due to the definitionw(1)= 0. This proves the first
part of the lemma (Equation (14.5)). The second part follows by upper bounding
‖w?‖byB,‖vt‖byρ, dividing byT, and plugging in the value ofη.

Understanding Machine Learning: From Theory to Algorithms

+

∑T

√

1

T

∑T

.

1

1

1

1

∑T

(

)

+

∑T

1

∑T

≤

1

∑T

=

1

∑T

Get our desktop app

Company

Features

Documentation

Resources