Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
14.5 Learning with SGD 199

toz 1 ,...,zT. Clearly,E[ft(w?)] =LD(w?). In addition, using the same argument
as in the proof of Theorem 14.8 we have that

E

[

1

T

∑T

t=1

ft(w(t))

]

=E

[

1

T

∑T

t=1

LD(w(t))

]

≥E[LD(w ̄)].

Combining all we conclude our proof.

As a direct corollary we obtain:

corollary14.14 Consider a convex-smooth-bounded learning problem with
parametersβ,B. Assume in addition that`( 0 ,z)≤ 1 for allz∈Z. For every
 > 0 , setη=β(1+3^1 /). Then, running SGD withT≥ 12 B^2 β/^2 yields

E[LD(w ̄)]≤wmin∈HLD(w) +.

14.5.3 SGD for Regularized Loss Minimization


We have shown that SGD enjoys the same worst-case sample complexity bound
as regularized loss minimization. However, on some distributions, regularized loss
minimization may yield a better solution. Therefore, in some cases we may want
to solve the optimization problem associated with regularized loss minimization,
namely,^1

minw

(

λ
2

‖w‖^2 +LS(w)

)

. (14.14)

Since we are dealing with convex learning problems in which the loss function is
convex, the preceding problem is also a convex optimization problem that can
be solved using SGD as well, as we shall see in this section.
Definef(w) =λ 2 ‖w‖^2 +LS(w). Note thatfis aλ-strongly convex function;
therefore, we can apply the SGD variant given in Section 14.4.4 (withH=Rd).
To apply this algorithm, we only need to find a way to construct an unbiased
estimate of a subgradient off atw(t). This is easily done by noting that if
we pickzuniformly at random fromS, and choosevt∈∂`(w(t),z) then the
expected value ofλw(t)+vtis a subgradient offatw(t).
To analyze the resulting algorithm, we first rewrite the update rule (assuming

(^1) We dividedλby 2 for convenience.

Free download pdf