Understanding Machine Learning: From Theory to Algorithms

90 Nonuniform Learnability

lemma7.6 (Kraft Inequality) IfS ⊆{ 0 , 1 }∗is a prefix-free set of strings, then ∑

σ∈S

1

2 |σ|

≤ 1.

Proof Define a probability distribution over the members ofSas follows: Re- peatedly toss an unbiased coin, with faces labeled 0 and 1, until the sequence of outcomes is a member ofS; at that point, stop. For eachσ∈ S, letP(σ) be the probability that this process generates the stringσ. Note that sinceSis prefix-free, for everyσ∈ S, if the coin toss outcomes follow the bits ofσthen we will stop only once the sequence of outcomes equalsσ. We therefore get that, for everyσ∈S,P(σ) = 2 |^1 σ|. Since probabilities add up to at most 1, our proof is concluded. In light of Kraft’s inequality, any prefix-free description language of a hypothesis class,H, gives rise to a weighting functionwover that hypothesis class – we will simply setw(h) = 2 |^1 h|. This observation immediately yields the following: theorem7.7 LetHbe a hypothesis class and letd:H→{ 0 , 1 }∗be a prefix- free description language forH. Then, for every sample size,m, every confidence parameter,δ > 0 , and every probability distribution,D, with probability greater than 1 −δover the choice ofS∼Dmwe have that,

∀h∈H, LD(h)≤LS(h) +

√

|h|+ ln(2/δ) 2 m

,

where|h|is the length ofd(h).

Proof Choosew(h) = 1/ 2 |h|, apply Theorem 7.4 withn(m,δ) =

√

ln(2/δ) 2 m , and note that ln(2|h|) =|h|ln(2)<|h|.

As was the case with Theorem 7.4, this result suggests a learning paradigm forH– given a training set,S, search for a hypothesish∈ Hthat minimizes the bound,LS(h) +

√

|h|+ln(2/δ) 2 m. In particular, it suggests trading off empirical risk for saving description length. This yields the Minimum Description Length learning paradigm.

Minimum Description Length (MDL) prior knowledge: His a countable hypothesis class His described by a prefix-free language over{ 0 , 1 } For everyh∈H,|h|is the length of the representation ofh input:A training setS∼Dm, confidenceδ output:h∈argminh∈H

[

LS(h) +

√

|h|+ln(2/δ) 2 m

]

Example 7.3 LetHbe the class of all predictors that can be implemented using some programming language, say, C++. Let us represent each program using the

Understanding Machine Learning: From Theory to Algorithms

1

≤ 1.

√

,

√

√

[

√

]

Get our desktop app

Company

Features

Documentation

Resources