Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
14.5 Learning with SGD 197

LD(w), that is, a random vector whose conditional expected value is∇LD(w(t)).
We shall now see how such an estimate can be easily constructed.
For simplicity, let us first consider the case of differentiable loss functions.
Hence the risk functionLDis also differentiable. The construction of the random
vectorvtwill be as follows: First, samplez∼ D. Then, definevt to be the
gradient of the function`(w,z) with respect tow, at the pointw(t). Then, by
the linearity of the gradient we have


E[vt|w(t)] =z∼DE[∇`(w(t),z)] =∇z∼DE[`(w(t),z)] =∇LD(w(t)). (14.13)

The gradient of the loss function(w,z) atw(t)is therefore an unbiased estimate of the gradient of the risk functionLD(w(t)) and is easily constructed by sampling a single fresh examplez∼Dat each iterationt. The same argument holds for nondifferentiable loss functions. We simply let vtbe a subgradient of(w,z) atw(t). Then, for everyuwe have


`(u,z)−`(w(t),z)≥〈u−w(t),vt〉.

Taking expectation on both sides with respect toz∼Dand conditioned on the
value ofw(t)we obtain


LD(u)−LD(w(t)) =E[`(u,z)−`(w(t),z)|w(t)]
≥E[〈u−w(t),vt〉|w(t)]
=〈u−w(t),E[vt|w(t)]〉.

It follows thatE[vt|w(t)] is a subgradient ofLD(w) atw(t).
To summarize, the stochastic gradient descent framework for minimizing the
risk is as follows.


Stochastic Gradient Descent (SGD) for minimizing
LD(w)

parameters:Scalarη >0, integerT > 0
initialize: w(1)= 0
fort= 1, 2 ,...,T
samplez∼D
pickvt∈∂`(w(t),z)
updatew(t+1)=w(t)−ηvt
outputw ̄=T^1

∑T

t=1w

(t)

We shall now use our analysis of SGD to obtain a sample complexity anal-
ysis for learning convex-Lipschitz-bounded problems. Theorem 14.8 yields the
following:


corollary14.12 Consider a convex-Lipschitz-bounded learning problem with
parametersρ,B. Then, for every > 0 , if we run the SGD method for minimizing

Free download pdf