30.1 Compression Bounds 411
theorem30.2 Letkbe an integer and letB:Zk→ Hbe a mapping from
sequences ofkexamples to the hypothesis class. Letm≥ 2 kbe a training set
size and letA:Zm→ Hbe a learning rule that receives a training sequenceS
of sizemand returns a hypothesis such thatA(S) =B(zi 1 ,...,zik)for some
(i 1 ,...,ik)∈[m]k. LetV={zj:j /∈(i 1 ,...,ik)}be the set of examples which
were not selected for definingA(S). Then, with probability of at least 1 −δover
the choice ofSwe have
LD(A(S)) ≤ LV(A(S)) +
√
LV(A(S))
4 klog(m/δ)
m
+
8 klog(m/δ)
m
.
Proof For anyI∈[m]klethI=B(zi 1 ,...,zik). Letn=m−k. Combining
Lemma 30.1 with the union bound we have
P
[
∃I∈[m]ks.t. LD(hI)−LV(hI)≥
√
2 LV(hI) log(1/δ)
n +
4 log(1/δ)
n
]
≤
∑
I∈[m]k
P
[
LD(hI)−LV(hI)≥
√
2 LV(hI) log(1/δ)
n
+4 log(1/δ)
n
]
≤mkδ.
Denoteδ′ =mkδ. Using the assumptionk ≤m/2, which implies thatn=
m−k≥m/2, the above implies that with probability of at least 1−δ′we have
that
LD(A(S)) ≤LV(A(S)) +
√
LV(A(S))^4 klog(m/δ
′)
m
+^8 klog(m/δ
′)
m
,
which concludes our proof.
As a direct corollary we obtain:
corollary30.3 Assuming the conditions of Theorem 30.2, and further as-
suming thatLV(A(S)) = 0, then, with probability of at least 1 −δover the choice
ofSwe have
LD(A(S)) ≤
8 klog(m/δ)
m
.
These results motivate the following definition:
definition30.4 (Compression Scheme)LetHbe a hypothesis class of
functions fromXtoYand letkbe an integer. We say thatHhas a compression
scheme of sizekif the following holds:
For allmthere existsA:Zm→[m]kandB:Zk→Hsuch that for allh∈H,
if we feed any training set of the form (x 1 ,h(x 1 )),...,(xm,h(xm)) intoAand
then feed (xi 1 ,h(xi 1 )),...,(xik,h(xik)) intoB, where (i 1 ,...,ik) is the output
ofA, then the output ofB, denotedh′, satisfiesLS(h′) = 0.
It is possible to generalize the definition for unrealizable sequences as follows.