Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
31.2 Bibliographic Remarks 417

Next, we claim that for allhwe haveES[e2(m−1)∆(h)


2
]≤m. To do so, recall that
Hoeffding’s inequality tells us that


PS[∆(h)≥]≤e−^2 m

2

This implies thatES[e2(m−1)∆(h)


2
]≤m(see Exercise 1). Combining this with
Equation (31.4) and plugging into Equation (31.1) we get


P
S

[f(S)≥] ≤ m
e

. (31.5)

Denote the right-hand side of the aboveδ, thus= ln(m/δ), and we therefore
obtain that with probability of at least 1−δwe have that for allQ


2(m−1) E
h∼Q

(∆(h))^2 −D(Q||P)≤= ln(m/δ).

Rearranging the inequality and using Jensen’s inequality again (the functionx^2
is convex) we conclude that
(
E
h∼Q


∆(h)

) 2

≤ E

h∼Q

(∆(h))^2 ≤ln(m/δ) +D(Q||P)
2(m−1)

. (31.6)

Remark 31.1 (Regularization) The PAC-Bayes bound leads to the following
learning rule:


Given a priorP, return a posteriorQthat minimizes the function


LS(Q) +


D(Q||P) + lnm/δ
2(m−1). (31.7)

This rule is similar to theregularized risk minimizationprinciple. That is, we
jointly minimize the empirical loss ofQon the sample and the Kullback-Leibler
“distance” betweenQandP.


31.2 Bibliographic Remarks


PAC-Bayes bounds were first introduced by McAllester (1998). See also (McAllester
1999, McAllester 2003, Seeger 2003, Langford & Shawe-Taylor 2003, Langford
2006).


31.3 Exercises



  1. LetX be a random variable that satisfiesP[X≥]≤e−^2 m


2

. Prove that
E[e2(m−1)X


2
]≤m.
Free download pdf