Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
14.1 Gradient Descent 187

lemma14.1 Letv 1 ,...,vTbe an arbitrary sequence of vectors. Any algorithm
with an initializationw(1)= 0and an update rule of the form


w(t+1)=w(t)−ηvt (14.4)

satisfies


∑T

t=1

〈w(t)−w?,vt〉≤

‖w?‖^2
2 η

+

η
2

∑T

t=1

‖vt‖^2. (14.5)

In particular, for everyB,ρ > 0 , if for alltwe have that‖vt‖≤ρand if we set


η=



B^2
ρ^2 T, then for everyw

?with‖w?‖≤Bwe have

1

T

∑T

t=1

〈w(t)−w?,vt〉≤B ρ√
T

.

Proof Using algebraic manipulations (completing the square), we obtain:


〈w(t)−w?,vt〉=

1

η

〈w(t)−w?,ηvt〉

=

1

2 η

(−‖w(t)−w?−ηvt‖^2 +‖w(t)−w?‖^2 +η^2 ‖vt‖^2 )

=

1

2 η

(−‖w(t+1)−w?‖^2 +‖w(t)−w?‖^2 ) +

η
2

‖vt‖^2 ,

where the last equality follows from the definition of the update rule. Summing
the equality overt, we have


∑T

t=1

〈w(t)−w?,vt〉 =

1

2 η

∑T

t=1

(

−‖w(t+1)−w?‖^2 +‖w(t)−w?‖^2

)

+

η
2

∑T

t=1

‖vt‖^2.

(14.6)
The first sum on the right-hand side is a telescopic sum that collapses to


‖w(1)−w?‖^2 −‖w(T+1)−w?‖^2.

Plugging this in Equation (14.6), we have


∑T

t=1

〈w(t)−w?,vt〉 =

1

2 η

(‖w(1)−w?‖^2 −‖w(T+1)−w?‖^2 ) +

η
2

∑T

t=1

‖vt‖^2


1

2 η

‖w(1)−w?‖^2 +

η
2

∑T

t=1

‖vt‖^2

=

1

2 η

‖w?‖^2 +

η
2

∑T

t=1

‖vt‖^2 ,

where the last equality is due to the definitionw(1)= 0. This proves the first
part of the lemma (Equation (14.5)). The second part follows by upper bounding
‖w?‖byB,‖vt‖byρ, dividing byT, and plugging in the value ofη.

Free download pdf