Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

278 Neural Networks


SGD for Neural Networks

parameters:
number of iterationsτ
step size sequenceη 1 ,η 2 ,...,ητ
regularization parameterλ > 0
input:
layered graph (V,E)
differentiable activation functionσ:R→R
initialize:
choosew(1)∈R|E|at random
(from a distribution s.t.w(1)is close enough to 0 )
fori= 1, 2 ,...,τ
sample (x,y)∼D
calculate gradientvi=backpropagation(x,y,w,(V,E),σ)
updatew(i+1)=w(i)−ηi(vi+λw(i))
output:
w ̄is the best performingw(i)on a validation set

Backpropagation

input:
example (x,y), weight vectorw, layered graph (V,E),
activation functionσ:R→R
initialize:
denote layers of the graphV 0 ,...,VTwhereVt={vt, 1 ,...,vt,kt}
defineWt,i,jas the weight of (vt,j,vt+1,i)
(where we setWt,i,j= 0 if (vt,j,vt+1,i)∈/E)
forward:
seto 0 =x
fort= 1,...,T
fori= 1,...,kt
setat,i=

∑kt− 1
j=1Wt−^1 ,i,jot−^1 ,j
setot,i=σ(at,i)
backward:
setδT=oT−y
fort=T− 1 ,T− 2 ,..., 1
fori= 1,...,kt
δt,i=

∑kt+1
j=1Wt,j,iδt+1,jσ′(at+1,j)
output:
foreach edge (vt− 1 ,j,vt,i)∈E
set the partial derivative toδt,iσ′(at,i)ot− 1 ,j
Free download pdf