Understanding Machine Learning: From Theory to Algorithms

20.7 Summary 281

In particular, δt=Jot(`t) =Jσ(Wtot)(`t+1)diag(σ′(Wtot))Wt =Jot+1(`t+1)diag(σ′(at+1))Wt =δt+1diag(σ′(at+1))Wt. In summary, we can first calculate the vectors{at,ot}from the bottom of the network to its top. Then, we calculate the vectors{δt}from the top of the network back to its bottom. Once we have all of these vectors, the partial derivatives are easily obtained using Equation (20.3). We have thus shown that the pseudocode of backpropagation indeed calculates the gradient.

20.7 Summary

Neural networks over graphs of sizes(n) can be used to describe hypothesis classes of all predictors that can be implemented in runtime ofO(

√

s(n)). We have also shown that their sample complexity depends polynomially ons(n) (specifically, it depends on the number of edges in the network). Therefore, classes of neural network hypotheses seem to be an excellent choice. Regrettably, the problem of training the network on the basis of training data is computationally hard. We have presented the SGD framework as a heuristic approach for training neural networks and described the backpropagation algorithm which efficiently calculates the gradient of the loss function with respect to the weights over the edges.

20.8 Bibliographic Remarks

Neural networks were extensively studied in the 1980s and early 1990s, but with mixed empirical success. In recent years, a combination of algorithmic advance- ments, as well as increasing computational power and data size, has led to a breakthrough in the effectiveness of neural networks. In particular, “deep networks” (i.e., networks of more than 2 layers) have shown very impressive practical performance on a variety of domains. A few examples include convolutional networks (Lecun & Bengio 1995), restricted Boltzmann machines (Hinton, Osindero & Teh 2006), auto-encoders (Ranzato, Huang, Boureau & Lecun 2007, Bengio & LeCun 2007, Collobert & Weston 2008, Lee, Grosse, Ranganath & Ng 2009, Le, Ranzato, Monga, Devin, Corrado, Chen, Dean & Ng 2012), and sum-product networks (Livni, Shalev-Shwartz & Shamir 2013, Poon & Domingos 2011). See also (Bengio 2009) and the references therein. The expressive power of neural networks and the relation to circuit complexity have been extensively studied in (Parberry 1994). For the analysis of the sample complexity of neural networks we refer the reader to (Anthony & Bartlet 1999). Our proof technique of Theorem 20.6 is due to Kakade and Tewari lecture notes.

Understanding Machine Learning: From Theory to Algorithms

20.7 Summary

√

20.8 Bibliographic Remarks

Get our desktop app

Company

Features

Documentation

Resources