Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

90 Nonuniform Learnability


lemma7.6 (Kraft Inequality) IfS ⊆{ 0 , 1 }∗is a prefix-free set of strings, then

σ∈S

1

2 |σ|

≤ 1.

Proof Define a probability distribution over the members ofSas follows: Re-
peatedly toss an unbiased coin, with faces labeled 0 and 1, until the sequence
of outcomes is a member ofS; at that point, stop. For eachσ∈ S, letP(σ)
be the probability that this process generates the stringσ. Note that sinceSis
prefix-free, for everyσ∈ S, if the coin toss outcomes follow the bits ofσthen
we will stop only once the sequence of outcomes equalsσ. We therefore get that,
for everyσ∈S,P(σ) = 2 |^1 σ|. Since probabilities add up to at most 1, our proof
is concluded.
In light of Kraft’s inequality, any prefix-free description language of a hypoth-
esis class,H, gives rise to a weighting functionwover that hypothesis class – we
will simply setw(h) = 2 |^1 h|. This observation immediately yields the following:
theorem7.7 LetHbe a hypothesis class and letd:H→{ 0 , 1 }∗be a prefix-
free description language forH. Then, for every sample size,m, every confidence
parameter,δ > 0 , and every probability distribution,D, with probability greater
than 1 −δover the choice ofS∼Dmwe have that,

∀h∈H, LD(h)≤LS(h) +


|h|+ ln(2/δ)
2 m

,

where|h|is the length ofd(h).

Proof Choosew(h) = 1/ 2 |h|, apply Theorem 7.4 withn(m,δ) =


ln(2/δ)
2 m , and
note that ln(2|h|) =|h|ln(2)<|h|.

As was the case with Theorem 7.4, this result suggests a learning paradigm
forH– given a training set,S, search for a hypothesish∈ Hthat minimizes
the bound,LS(h) +


|h|+ln(2/δ)
2 m. In particular, it suggests trading off empirical
risk for saving description length. This yields the Minimum Description Length
learning paradigm.

Minimum Description Length (MDL)
prior knowledge:
His a countable hypothesis class
His described by a prefix-free language over{ 0 , 1 }
For everyh∈H,|h|is the length of the representation ofh
input:A training setS∼Dm, confidenceδ
output:h∈argminh∈H

[

LS(h) +


|h|+ln(2/δ)
2 m

]

Example 7.3 LetHbe the class of all predictors that can be implemented using
some programming language, say, C++. Let us represent each program using the
Free download pdf