Understanding Machine Learning: From Theory to Algorithms

14.4 Variants 195

14.4.3 Other Averaging Techniques

We have set the output vector to bew ̄ =T^1

∑T

t=1w(t). There are alternative approaches such as outputtingw(t)for some randomt∈[t], or outputting the average ofw(t)over the lastαTiterations, for someα∈(0,1). One can also take a weighted average of the last few iterates. These more sophisticated averaging schemes can improve the convergence speed in some situations, such as in the case of strongly convex functions defined in the following.

14.4.4 Strongly Convex Functions* Contents xiii

In this section we show a variant of SGD that enjoys a faster convergence rate for problems in which the objective function is strongly convex (see Definition 13.4 of strong convexity in the previous chapter). We rely on the following claim, which generalizes Lemma 13.5.

claim14.10 Iffisλ-strongly convex then for everyw,uandv∈∂f(w)we have 〈w−u,v〉 ≥ f(w)−f(u) +λ 2 ‖w−u‖^2.

The proof is similar to the proof of Lemma 13.5 and is left as an exercise.

SGD for minimizing aλ-strongly convex function

Goal:Solve minw∈Hf(w) parameter:T initialize: w(1)= 0 fort= 1,...,T Choose a random vectorvts.t.E[vt|w(t)]∈∂f(w(t)) Setηt= 1/(λt) Setw(t+^12 )=w(t)−ηtvt Setw(t+1)= arg minw∈H‖w−w(t+^12 )‖^2 output:w ̄=T^1

∑T

t=1w

(t)

theorem14.11 Assume thatfisλ-strongly convex and thatE[‖vt‖^2 ]≤ρ^2. Letw?∈argminw∈Hf(w)be an optimal solution. Then,

E[f(w ̄)]−f(w?)≤

ρ^2 2 λT

(1 + log(T)).

Proof Let∇(t) =E[vt|w(t)]. Sincef is strongly convex and∇(t) is in the subgradient set offatw(t)we have that

〈w(t)−w?,∇(t)〉 ≥ f(w(t))−f(w?) +λ 2 ‖w(t)−w?‖^2. (14.11)

Next, we show that

〈w(t)−w?,∇(t)〉 ≤

E[‖w(t)−w?‖^2 −‖w(t+1)−w?‖^2 ] 2 ηt

+

ηt 2

ρ^2. (14.12)

Understanding Machine Learning: From Theory to Algorithms

14.4.3 Other Averaging Techniques

∑T

14.4.4 Strongly Convex Functions* Contents xiii

∑T

+

Get our desktop app

Company

Features

Documentation

Resources