14.1 Gradient Descent 187lemma14.1 Letv 1 ,...,vTbe an arbitrary sequence of vectors. Any algorithm
with an initializationw(1)= 0and an update rule of the form
w(t+1)=w(t)−ηvt (14.4)satisfies
∑Tt=1〈w(t)−w?,vt〉≤‖w?‖^2
2 η+
η
2∑T
t=1‖vt‖^2. (14.5)In particular, for everyB,ρ > 0 , if for alltwe have that‖vt‖≤ρand if we set
η=
√
B^2
ρ^2 T, then for everyw?with‖w?‖≤Bwe have1
T
∑T
t=1〈w(t)−w?,vt〉≤B ρ√
T.
Proof Using algebraic manipulations (completing the square), we obtain:
〈w(t)−w?,vt〉=1
η〈w(t)−w?,ηvt〉=1
2 η(−‖w(t)−w?−ηvt‖^2 +‖w(t)−w?‖^2 +η^2 ‖vt‖^2 )=1
2 η(−‖w(t+1)−w?‖^2 +‖w(t)−w?‖^2 ) +η
2‖vt‖^2 ,where the last equality follows from the definition of the update rule. Summing
the equality overt, we have
∑Tt=1〈w(t)−w?,vt〉 =1
2 η∑T
t=1(
−‖w(t+1)−w?‖^2 +‖w(t)−w?‖^2)
+
η
2∑T
t=1‖vt‖^2.(14.6)
The first sum on the right-hand side is a telescopic sum that collapses to
‖w(1)−w?‖^2 −‖w(T+1)−w?‖^2.Plugging this in Equation (14.6), we have
∑Tt=1〈w(t)−w?,vt〉 =1
2 η(‖w(1)−w?‖^2 −‖w(T+1)−w?‖^2 ) +η
2∑T
t=1‖vt‖^2≤
1
2 η‖w(1)−w?‖^2 +η
2∑T
t=1‖vt‖^2=
1
2 η‖w?‖^2 +η
2∑T
t=1‖vt‖^2 ,where the last equality is due to the definitionw(1)= 0. This proves the first
part of the lemma (Equation (14.5)). The second part follows by upper bounding
‖w?‖byB,‖vt‖byρ, dividing byT, and plugging in the value ofη.