31.2 Bibliographic Remarks 417Next, we claim that for allhwe haveES[e2(m−1)∆(h)
2
]≤m. To do so, recall that
Hoeffding’s inequality tells us that
PS[∆(h)≥]≤e−^2 m2This implies thatES[e2(m−1)∆(h)
2
]≤m(see Exercise 1). Combining this with
Equation (31.4) and plugging into Equation (31.1) we get
P
S[f(S)≥] ≤ m
e. (31.5)
Denote the right-hand side of the aboveδ, thus= ln(m/δ), and we therefore
obtain that with probability of at least 1−δwe have that for allQ
2(m−1) E
h∼Q(∆(h))^2 −D(Q||P)≤= ln(m/δ).Rearranging the inequality and using Jensen’s inequality again (the functionx^2
is convex) we conclude that
(
E
h∼Q
∆(h)) 2
≤ E
h∼Q(∆(h))^2 ≤ln(m/δ) +D(Q||P)
2(m−1). (31.6)
Remark 31.1 (Regularization) The PAC-Bayes bound leads to the following
learning rule:
Given a priorP, return a posteriorQthat minimizes the function
LS(Q) +√
D(Q||P) + lnm/δ
2(m−1). (31.7)This rule is similar to theregularized risk minimizationprinciple. That is, we
jointly minimize the empirical loss ofQon the sample and the Kullback-Leibler
“distance” betweenQandP.
31.2 Bibliographic Remarks
PAC-Bayes bounds were first introduced by McAllester (1998). See also (McAllester
1999, McAllester 2003, Seeger 2003, Langford & Shawe-Taylor 2003, Langford
2006).
31.3 Exercises
- LetX be a random variable that satisfiesP[X≥]≤e−^2 m
2. Prove that
E[e2(m−1)X
2
]≤m.