Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
14.4 Variants 195

14.4.3 Other Averaging Techniques


We have set the output vector to bew ̄ =T^1

∑T

t=1w(t). There are alternative
approaches such as outputtingw(t)for some randomt∈[t], or outputting the
average ofw(t)over the lastαTiterations, for someα∈(0,1). One can also take
a weighted average of the last few iterates. These more sophisticated averaging
schemes can improve the convergence speed in some situations, such as in the
case of strongly convex functions defined in the following.

14.4.4 Strongly Convex Functions* Contents xiii


In this section we show a variant of SGD that enjoys a faster convergence rate for
problems in which the objective function is strongly convex (see Definition 13.4
of strong convexity in the previous chapter). We rely on the following claim,
which generalizes Lemma 13.5.

claim14.10 Iffisλ-strongly convex then for everyw,uandv∈∂f(w)we
have
〈w−u,v〉 ≥ f(w)−f(u) +λ 2 ‖w−u‖^2.

The proof is similar to the proof of Lemma 13.5 and is left as an exercise.

SGD for minimizing aλ-strongly convex function

Goal:Solve minw∈Hf(w)
parameter:T
initialize: w(1)= 0
fort= 1,...,T
Choose a random vectorvts.t.E[vt|w(t)]∈∂f(w(t))
Setηt= 1/(λt)
Setw(t+^12 )=w(t)−ηtvt
Setw(t+1)= arg minw∈H‖w−w(t+^12 )‖^2
output:w ̄=T^1

∑T

t=1w

(t)

theorem14.11 Assume thatfisλ-strongly convex and thatE[‖vt‖^2 ]≤ρ^2.
Letw?∈argminw∈Hf(w)be an optimal solution. Then,

E[f(w ̄)]−f(w?)≤

ρ^2
2 λT

(1 + log(T)).

Proof Let∇(t) =E[vt|w(t)]. Sincef is strongly convex and∇(t) is in the
subgradient set offatw(t)we have that

〈w(t)−w?,∇(t)〉 ≥ f(w(t))−f(w?) +λ 2 ‖w(t)−w?‖^2. (14.11)

Next, we show that

〈w(t)−w?,∇(t)〉 ≤

E[‖w(t)−w?‖^2 −‖w(t+1)−w?‖^2 ]
2 ηt

+

ηt
2

ρ^2. (14.12)
Free download pdf