282 Neural Networks
Klivans & Sherstov (2006) have shown that for anyc >0, intersections ofnc
halfspaces over{± 1 }nare not efficiently PAC learnable, even if we allow repre-
sentation independent learning. This hardness result relies on the cryptographic
assumption that there is no polynomial time solution to the unique-shortest-
vector problem. As we have argued, this implies that there cannot be an efficient
algorithm for training neural networks, even if we allow larger networks or other
activation functions that can be implemented efficiently.
The backpropagation algorithm has been introduced in Rumelhart, Hinton &
Williams (1986).
20.9 Exercises
1.Neural Networks are universal approximators: Letf : [− 1 ,1]n →
[− 1 ,1] be aρ-Lipschitz function. Fix some >0. Construct a neural net-
workN: [− 1 ,1]n→[− 1 ,1], with the sigmoid activation function, such that
for everyx∈[− 1 ,1]nit holds that|f(x)−N(x)|≤.
Hint: Similarly to the proof of Theorem 19.3, partition [− 1 ,1]ninto small
boxes. Use the Lipschitzness offto show that it is approximately constant
at each box. Finally, show that a neural network can first decide which box
the input vector belongs to, and then predict the averaged value offat that
box.
- Prove Theorem 20.5.
Hint: For every f :{− 1 , 1 }n → {− 1 , 1 }construct a 1-Lipschitz function
g: [− 1 ,1]n→[− 1 ,1] such that if you can approximategthen you can express
f.
3.Growth function of product:Fori= 1,2, letFibe a set of functions from
XtoYi. DefineH=F 1 ×F 2 to be the Cartesian product class. That is, for
everyf 1 ∈F 1 andf 2 ∈F 2 , there existsh∈Hsuch thath(x) = (f 1 (x),f 2 (x)).
Prove thatτH(m)≤τF 1 (m)τF 2 (m).
4.Growth function of composition:LetF 1 be a set of functions fromX
toZand letF 2 be a set of functions fromZtoY. LetH=F 2 ◦F 1 be the
composition class. That is, for everyf 1 ∈F 1 andf 2 ∈F 2 , there existsh∈H
such thath(x) =f 2 (f 1 (x)). Prove thatτH(m)≤τF 2 (m)τF 1 (m).
5.VC of sigmoidal networks:In this exercise we show that there is a graph
(V,E) such that the VC dimension of the class of neural networks over these
graphs with the sigmoid activation function is Ω(|E|^2 ). Note that for every >
0, the sigmoid activation function can approximate the threshold activation
function, (^1) [∑ixi], up to accuracy. To simplify the presentation, throughout
the exercise we assume that we can exactly implement the activation function
(^1) [∑ixi>0]using a sigmoid activation function.
Fix somen.
- Construct a network,N 1 , withO(n) weights, which implements a function
fromRto{ 0 , 1 }nand satisfies the following property. For everyx∈{ 0 , 1 }n,