Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
20.3 The Expressive Power of Neural Networks 271

That is, the parameters specifying a hypothesis in the hypothesis class are the
weights over the edges of the network.
We can now study the approximation error, estimation error, and optimization
error of such hypothesis classes. In Section 20.3 we study the approximation
error ofHV,E,σby studying what type of functions hypotheses inHV,E,σcan
implement, in terms of the size of the underlying graph. In Section 20.4 we
study the estimation error ofHV,E,σ, for the case of binary classification (i.e.,
VT= 1 andσis the sign function), by analyzing its VC dimension. Finally, in
Section 20.5 we show that it is computationally hard to learn the classHV,E,σ,
even if the underlying graph is small, and in Section 20.6 we present the most
commonly used heuristic for trainingHV,E,σ.

20.3 The Expressive Power of Neural Networks


In this section we study the expressive power of neural networks, namely, what
type of functions can be implemented using a neural network. More concretely,
we will fix some architecture,V,E,σ, and will study what functions hypotheses
inHV,E,σcan implement, as a function of the size ofV.
We start the discussion with studying which type of Boolean functions (i.e.,
functions from{± 1 }nto{± 1 }) can be implemented byHV,E,sign. Observe that
for every computer in which real numbers are stored usingbbits, whenever we
calculate a functionf :Rn→Ron such a computer we in fact calculate a
functiong:{± 1 }nb→{± 1 }b. Therefore, studying which Boolean functions can
be implemented byHV,E,signcan tell us which functions can be implemented on
a computer that stores real numbers usingbbits.
We begin with a simple claim, showing that without restricting the size of the
network, every Boolean function can be implemented using a neural network of
depth 2.

claim20.1 For everyn, there exists a graph(V,E)of depth 2 , such that
HV,E,signcontains all functions from{± 1 }nto{± 1 }.

Proof We construct a graph with|V 0 |=n+ 1,|V 1 |= 2n+ 1,and|V 2 |= 1. Let
Ebe all possible edges between adjacent layers. Now, letf:{± 1 }n→ {± 1 }
be some Boolean function. We need to show that we can adjust the weights so
that the network will implementf. Letu 1 ,...,ukbe all vectors in{± 1 }non
whichfoutputs 1. Observe that for everyiand everyx∈ {± 1 }n, ifx 6 =ui
then〈x,ui〉≤n−2 and ifx=uithen〈x,ui〉=n. It follows that the function
gi(x) = sign(〈x,ui〉−n+ 1) equals 1 if and only ifx=ui. It follows that we can
adapt the weights betweenV 0 andV 1 so that for everyi∈[k], the neuronv 1 ,i
implements the functiongi(x). Next, we observe thatf(x) is the disjunction of
Free download pdf