Understanding Machine Learning: From Theory to Algorithms

20.6 SGD and Backpropagation 279

Explaining How Backpropagation Calculates the Gradient:

We next explain how the backpropagation algorithm calculates the gradient of
the loss function on an example (x,y) with respect to the vectorw. Let us first
recall a few definitions from vector calculus. Each element of the gradient is
the partial derivative with respect to the variable inwcorresponding to one of
the edges of the network. Recall the definition of a partial derivative. Given a
functionf:Rn→R, the partial derivative with respect to theith variable atw
is obtained by fixing the values ofw 1 ,...,wi− 1 ,wi+1,wn, which yields the scalar
functiong:R→Rdefined byg(a) =f((w 1 ,...,wi− 1 ,wi+a,wi+1,...,wn)),
and then taking the derivative ofgat 0. For a function with multiple outputs,
f:Rn→Rm, theJacobianoffatw∈Rn, denotedJw(f), is them×nmatrix
whosei,jelement is the partial derivative offi:Rn→Rw.r.t. itsjth variable
atw. Note that ifm= 1 then the Jacobian matrix is the gradient of the function
(represented as a row vector). Two examples of Jacobian calculations, which we
will later use, are as follows.

Letf(w) =AwforA∈Rm,n. ThenJw(f) =A.

For everyn, we use the notationσto denote the function fromRntoRn
which applies the sigmoid function element-wise. That is,α=σ(θ) means
that for everyiwe haveαi =σ(θi) = 1+exp(^1 −θi). It is easy to verify
thatJθ(σ) is a diagonal matrix whose (i,i) entry isσ′(θi), whereσ′is
the derivative function of the (scalar) sigmoid function, namely,σ′(θi) =
1
(1+exp(θi))(1+exp(−θi)). We also use the notation diag(σ

′(θ)) to denote this matrix.

Thechain rulefor taking the derivative of a composition of functions can be
written in terms of the Jacobian as follows. Given two functionsf:Rn→Rm
andg:Rk →Rn, we have that the Jacobian of the composition function,
(f◦g) :Rk→Rm, atw, is

Jw(f◦g) =Jg(w)(f)Jw(g).

For example, forg(w) =Aw, whereA∈Rn,k, we have that

Jw(σ◦g) = diag(σ′(Aw))A.

To describe the backpropagation algorithm, let us first decomposeV into the
layers of the graph,V=∪·Tt=0Vt. For everyt, let us writeVt={vt, 1 ,...,vt,kt},
wherekt=|Vt|. In addition, for everytdenoteWt∈Rkt+1,kta matrix which
gives a weight to every potential edge betweenVtandVt+1. If the edge exists in
Ethen we setWt,i,jto be the weight, according tow, of the edge (vt,j,vt+1,i).
Otherwise, we add a “phantom” edge and set its weight to be zero,Wt,i,j= 0.
Since when calculating the partial derivative with respect to the weight of some
edge we fix all other weights, these additional “phantom” edges have no effect
on the partial derivative with respect to existing edges. It follows that we can
assume, without loss of generality, that all edges exist, that is,E=∪t(Vt×Vt+1).

Understanding Machine Learning: From Theory to Algorithms

Explaining How Backpropagation Calculates the Gradient:

Get our desktop app

Company

Features

Documentation

Resources