90 Nonuniform Learnability
lemma7.6 (Kraft Inequality) IfS ⊆{ 0 , 1 }∗is a prefix-free set of strings, then
∑
σ∈S
1
2 |σ|
≤ 1.
Proof Define a probability distribution over the members ofSas follows: Re-
peatedly toss an unbiased coin, with faces labeled 0 and 1, until the sequence
of outcomes is a member ofS; at that point, stop. For eachσ∈ S, letP(σ)
be the probability that this process generates the stringσ. Note that sinceSis
prefix-free, for everyσ∈ S, if the coin toss outcomes follow the bits ofσthen
we will stop only once the sequence of outcomes equalsσ. We therefore get that,
for everyσ∈S,P(σ) = 2 |^1 σ|. Since probabilities add up to at most 1, our proof
is concluded.
In light of Kraft’s inequality, any prefix-free description language of a hypoth-
esis class,H, gives rise to a weighting functionwover that hypothesis class – we
will simply setw(h) = 2 |^1 h|. This observation immediately yields the following:
theorem7.7 LetHbe a hypothesis class and letd:H→{ 0 , 1 }∗be a prefix-
free description language forH. Then, for every sample size,m, every confidence
parameter,δ > 0 , and every probability distribution,D, with probability greater
than 1 −δover the choice ofS∼Dmwe have that,
∀h∈H, LD(h)≤LS(h) +
√
|h|+ ln(2/δ)
2 m
,
where|h|is the length ofd(h).
Proof Choosew(h) = 1/ 2 |h|, apply Theorem 7.4 withn(m,δ) =
√
ln(2/δ)
2 m , and
note that ln(2|h|) =|h|ln(2)<|h|.
As was the case with Theorem 7.4, this result suggests a learning paradigm
forH– given a training set,S, search for a hypothesish∈ Hthat minimizes
the bound,LS(h) +
√
|h|+ln(2/δ)
2 m. In particular, it suggests trading off empirical
risk for saving description length. This yields the Minimum Description Length
learning paradigm.
Minimum Description Length (MDL)
prior knowledge:
His a countable hypothesis class
His described by a prefix-free language over{ 0 , 1 }
For everyh∈H,|h|is the length of the representation ofh
input:A training setS∼Dm, confidenceδ
output:h∈argminh∈H
[
LS(h) +
√
|h|+ln(2/δ)
2 m
]
Example 7.3 LetHbe the class of all predictors that can be implemented using
some programming language, say, C++. Let us represent each program using the