Understanding Machine Learning: From Theory to Algorithms

31.2 Bibliographic Remarks 417

Next, we claim that for allhwe haveES[e2(m−1)∆(h)

2
]≤m. To do so, recall that
Hoeffding’s inequality tells us that

PS[∆(h)≥]≤e−^2 m

2

This implies thatES[e2(m−1)∆(h)

2
]≤m(see Exercise 1). Combining this with
Equation (31.4) and plugging into Equation (31.1) we get

P S

[f(S)≥] ≤ m e

. (31.5)

Denote the right-hand side of the aboveδ, thus= ln(m/δ), and we therefore
obtain that with probability of at least 1−δwe have that for allQ

2(m−1) E h∼Q

(∆(h))^2 −D(Q||P)≤= ln(m/δ).

Rearranging the inequality and using Jensen’s inequality again (the functionx^2
is convex) we conclude that
(
E
h∼Q

∆(h)

) 2

≤ E

h∼Q

(∆(h))^2 ≤ln(m/δ) +D(Q||P) 2(m−1)

. (31.6)

Remark 31.1 (Regularization) The PAC-Bayes bound leads to the following
learning rule:

Given a priorP, return a posteriorQthat minimizes the function

LS(Q) +

√ D(Q||P) + lnm/δ 2(m−1). (31.7)

This rule is similar to theregularized risk minimizationprinciple. That is, we
jointly minimize the empirical loss ofQon the sample and the Kullback-Leibler
“distance” betweenQandP.

31.2 Bibliographic Remarks

PAC-Bayes bounds were first introduced by McAllester (1998). See also (McAllester
1999, McAllester 2003, Seeger 2003, Langford & Shawe-Taylor 2003, Langford
2006).

31.3 Exercises

LetX be a random variable that satisfiesP[X≥]≤e−^2 m

2

. Prove that
E[e2(m−1)X

2 ]≤m.

Understanding Machine Learning: From Theory to Algorithms

. (31.5)

) 2

≤ E

. (31.6)

31.2 Bibliographic Remarks

31.3 Exercises

Get our desktop app

Company

Features

Documentation

Resources