Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

416 PAC-Bayes


later show how in some cases this idea leads to the regularized risk minimization
principle.

theorem31.1 LetDbe an arbitrary distribution over an example domainZ.
LetHbe a hypothesis class and let`:H×Z→[0,1]be a loss function. LetPbe
a prior distribution overHand letδ∈(0,1). Then, with probability of at least
1 −δover the choice of an i.i.d. training setS={z 1 ,...,zm}sampled according
toD, for all distributionsQoverH(even such that depend onS), we have

LD(Q)≤LS(Q) +


D(Q||P) + lnm/δ
2(m−1)

,

where
D(Q||P)def=h∼EQ[ln(Q(h)/P(h))]

is the Kullback-Leibler divergence.

Proof For any functionf(S), using Markov’s inequality:

P

S

[f(S)≥] =P
S

[ef(S)≥e]≤ES[e

f(S)]
e

. (31.1)

Let ∆(h) =LD(h)−LS(h). We will apply Equation (31.1) with the function

f(S) = sup
Q

(

2(m−1) E
h∼Q

(∆(h))^2 −D(Q||P)

)

.

We now turn to boundES[ef(S)]. The main trick is to upper boundf(S) by
using an expression that does not depend onQbut rather depends on the prior
probabilityP. To do so, fix someSand note that from the definition ofD(Q||P)
we get that for allQ,

2(m−1) E
h∼Q

(∆(h))^2 −D(Q||P) = E
h∼Q

[ln(e2(m−1)∆(h)

2
P(h)/Q(h))]

≤lnhE∼Q[e2(m−1)∆(h)

2
P(h)/Q(h)]

= lnh∼EP[e2(m−1)∆(h)

2
], (31.2)

where the inequality follows from Jensen’s inequality and the concavity of the
log function. Therefore,

ES[ef(S)]≤ESh∼EP[e2(m−1)∆(h)

2
]. (31.3)

The advantage of the expression on the right-hand side stems from the fact that
we can switch the order of expectations (becauseP is a prior that does not
depend onS), which yields

E
S

[ef(S)]≤ E
h∼P

E

S

[e2(m−1)∆(h)

2
]. (31.4)
Free download pdf