14.1 Gradient Descent 187
lemma14.1 Letv 1 ,...,vTbe an arbitrary sequence of vectors. Any algorithm
with an initializationw(1)= 0and an update rule of the form
w(t+1)=w(t)−ηvt (14.4)
satisfies
∑T
t=1
〈w(t)−w?,vt〉≤
‖w?‖^2
2 η
+
η
2
∑T
t=1
‖vt‖^2. (14.5)
In particular, for everyB,ρ > 0 , if for alltwe have that‖vt‖≤ρand if we set
η=
√
B^2
ρ^2 T, then for everyw
?with‖w?‖≤Bwe have
1
T
∑T
t=1
〈w(t)−w?,vt〉≤B ρ√
T
.
Proof Using algebraic manipulations (completing the square), we obtain:
〈w(t)−w?,vt〉=
1
η
〈w(t)−w?,ηvt〉
=
1
2 η
(−‖w(t)−w?−ηvt‖^2 +‖w(t)−w?‖^2 +η^2 ‖vt‖^2 )
=
1
2 η
(−‖w(t+1)−w?‖^2 +‖w(t)−w?‖^2 ) +
η
2
‖vt‖^2 ,
where the last equality follows from the definition of the update rule. Summing
the equality overt, we have
∑T
t=1
〈w(t)−w?,vt〉 =
1
2 η
∑T
t=1
(
−‖w(t+1)−w?‖^2 +‖w(t)−w?‖^2
)
+
η
2
∑T
t=1
‖vt‖^2.
(14.6)
The first sum on the right-hand side is a telescopic sum that collapses to
‖w(1)−w?‖^2 −‖w(T+1)−w?‖^2.
Plugging this in Equation (14.6), we have
∑T
t=1
〈w(t)−w?,vt〉 =
1
2 η
(‖w(1)−w?‖^2 −‖w(T+1)−w?‖^2 ) +
η
2
∑T
t=1
‖vt‖^2
≤
1
2 η
‖w(1)−w?‖^2 +
η
2
∑T
t=1
‖vt‖^2
=
1
2 η
‖w?‖^2 +
η
2
∑T
t=1
‖vt‖^2 ,
where the last equality is due to the definitionw(1)= 0. This proves the first
part of the lemma (Equation (14.5)). The second part follows by upper bounding
‖w?‖byB,‖vt‖byρ, dividing byT, and plugging in the value ofη.