31.2 Bibliographic Remarks 417
Next, we claim that for allhwe haveES[e2(m−1)∆(h)
2
]≤m. To do so, recall that
Hoeffding’s inequality tells us that
PS[∆(h)≥]≤e−^2 m
2
This implies thatES[e2(m−1)∆(h)
2
]≤m(see Exercise 1). Combining this with
Equation (31.4) and plugging into Equation (31.1) we get
P
S
[f(S)≥] ≤ m
e
. (31.5)
Denote the right-hand side of the aboveδ, thus= ln(m/δ), and we therefore
obtain that with probability of at least 1−δwe have that for allQ
2(m−1) E
h∼Q
(∆(h))^2 −D(Q||P)≤= ln(m/δ).
Rearranging the inequality and using Jensen’s inequality again (the functionx^2
is convex) we conclude that
(
E
h∼Q
∆(h)
) 2
≤ E
h∼Q
(∆(h))^2 ≤ln(m/δ) +D(Q||P)
2(m−1)
. (31.6)
Remark 31.1 (Regularization) The PAC-Bayes bound leads to the following
learning rule:
Given a priorP, return a posteriorQthat minimizes the function
LS(Q) +
√
D(Q||P) + lnm/δ
2(m−1). (31.7)
This rule is similar to theregularized risk minimizationprinciple. That is, we
jointly minimize the empirical loss ofQon the sample and the Kullback-Leibler
“distance” betweenQandP.
31.2 Bibliographic Remarks
PAC-Bayes bounds were first introduced by McAllester (1998). See also (McAllester
1999, McAllester 2003, Seeger 2003, Langford & Shawe-Taylor 2003, Langford
2006).
31.3 Exercises
- LetX be a random variable that satisfiesP[X≥]≤e−^2 m
2
. Prove that
E[e2(m−1)X
2
]≤m.