Understanding Machine Learning: From Theory to Algorithms

14.5 Learning with SGD 199

toz 1 ,...,zT. Clearly,E[ft(w?)] =LD(w?). In addition, using the same argument as in the proof of Theorem 14.8 we have that

E

[

1

T

∑T

t=1

ft(w(t))

]

=E

[

1

T

∑T

t=1

LD(w(t))

]

≥E[LD(w ̄)].

Combining all we conclude our proof.

As a direct corollary we obtain:

corollary14.14 Consider a convex-smooth-bounded learning problem with parametersβ,B. Assume in addition that`( 0 ,z)≤ 1 for allz∈Z. For every > 0 , setη=β(1+3^1 /). Then, running SGD withT≥ 12 B^2 β/^2 yields

E[LD(w ̄)]≤wmin∈HLD(w) +.

14.5.3 SGD for Regularized Loss Minimization

We have shown that SGD enjoys the same worst-case sample complexity bound as regularized loss minimization. However, on some distributions, regularized loss minimization may yield a better solution. Therefore, in some cases we may want to solve the optimization problem associated with regularized loss minimization, namely,^1

minw

(

λ 2

‖w‖^2 +LS(w)

)

. (14.14)

Since we are dealing with convex learning problems in which the loss function is convex, the preceding problem is also a convex optimization problem that can be solved using SGD as well, as we shall see in this section. Definef(w) =λ 2 ‖w‖^2 +LS(w). Note thatfis aλ-strongly convex function; therefore, we can apply the SGD variant given in Section 14.4.4 (withH=Rd). To apply this algorithm, we only need to find a way to construct an unbiased estimate of a subgradient off atw(t). This is easily done by noting that if we pickzuniformly at random fromS, and choosevt∈∂`(w(t),z) then the expected value ofλw(t)+vtis a subgradient offatw(t). To analyze the resulting algorithm, we first rewrite the update rule (assuming

(^1) We dividedλby 2 for convenience.

Understanding Machine Learning: From Theory to Algorithms

E

[

1

T

∑T

]

=E

[

1

T

∑T

]

14.5.3 SGD for Regularized Loss Minimization

(

)

. (14.14)

Get our desktop app

Company

Features

Documentation

Resources