Understanding Machine Learning: From Theory to Algorithms

20.4 The Sample Complexity of Neural Networks 275

Proof To simplify the notation throughout the proof, let us denote the hy-
pothesis class byH. Recall the definition of the growth function,τH(m), from
Section 6.5.1. This function measures maxC⊂X:|C|=m|HC|, whereHCis the re-
striction ofHto functions fromCto{ 0 , 1 }. We can naturally extend the defi-
nition for a set of functions fromXto some finite setY, by lettingHCbe the
restriction ofHto functions fromCtoY, and keeping the definition ofτH(m)
intact.
Our neural network is defined by a layered graph. LetV 0 ,...,VTbe the layers
of the graph. Fix somet∈[T]. By assigning different weights on the edges
betweenVt− 1 andVt, we obtain different functions fromR|Vt−^1 |→{± 1 }|Vt|. Let
H(t)be the class of all possible such mappings fromR|Vt−^1 |→ {± 1 }|Vt|. Then,
Hcan be written as a composition,H=H(T)◦...◦H(1). In Exercise 4 we show
that the growth function of a composition of hypothesis classes is bounded by
the products of the growth functions of the individual classes. Therefore,

τH(m)≤

∏T

t=1

τH(t)(m).

In addition, eachH(t)can be written as a product of function classes,H(t)=
H(t,1)×···×H(t,|Vt|), where eachH(t,j)is all functions from layert−1 to{± 1 }
that thejth neuron of layertcan implement. In Exercise 3 we bound product
classes, and this yields

τH(t)(m)≤

|∏Vt|

i=1

τH(t,i)(m).

Letdt,ibe the number of edges that are headed to theith neuron of layert.
Since the neuron is a homogenous halfspace hypothesis and the VC dimension
of homogenous halfspaces is the dimension of their input, we have by Sauer’s
lemma that

τH(t,i)(m)≤

(

em dt,i

)dt,i ≤(em)dt,i.

Overall, we obtained that

τH(m)≤(em)

∑ t,idt,i= (em)|E|.

Now, assume that there aremshattered points. Then, we must haveτH(m) =
2 m, from which we obtain

2 m≤(em)|E| ⇒ m≤|E|log(em)/log(2).

The claim follows by Lemma A.2.

Next, we considerHV,E,σ, whereσis the sigmoid function. Surprisingly, it
turns out that the VC dimension ofHV,E,σis lower bounded by Ω(|E|^2 ) (see
Exercise 5.) That is, the VC dimension is the number of tunable parameters
squared. It is also possible to upper bound the VC dimension byO(|V|^2 |E|^2 ),
but the proof is beyond the scope of this book. In any case, since in practice

Understanding Machine Learning: From Theory to Algorithms

∏T

(

Get our desktop app

Company

Features

Documentation

Resources