Understanding Machine Learning: From Theory to Algorithms

416 PAC-Bayes

later show how in some cases this idea leads to the regularized risk minimization principle.

theorem31.1 LetDbe an arbitrary distribution over an example domainZ. LetHbe a hypothesis class and let`:H×Z→[0,1]be a loss function. LetPbe a prior distribution overHand letδ∈(0,1). Then, with probability of at least 1 −δover the choice of an i.i.d. training setS={z 1 ,...,zm}sampled according toD, for all distributionsQoverH(even such that depend onS), we have

LD(Q)≤LS(Q) +

√

D(Q||P) + lnm/δ 2(m−1)

,

where D(Q||P)def=h∼EQ[ln(Q(h)/P(h))]

is the Kullback-Leibler divergence.

Proof For any functionf(S), using Markov’s inequality:

P

S

[f(S)≥] =P S

[ef(S)≥e]≤ES[e

f(S)] e

. (31.1)

Let ∆(h) =LD(h)−LS(h). We will apply Equation (31.1) with the function

f(S) = sup Q

(

2(m−1) E h∼Q

(∆(h))^2 −D(Q||P)

)

.

We now turn to boundES[ef(S)]. The main trick is to upper boundf(S) by using an expression that does not depend onQbut rather depends on the prior probabilityP. To do so, fix someSand note that from the definition ofD(Q||P) we get that for allQ,

2(m−1) E h∼Q

(∆(h))^2 −D(Q||P) = E h∼Q

[ln(e2(m−1)∆(h)

2 P(h)/Q(h))]

≤lnhE∼Q[e2(m−1)∆(h)

2 P(h)/Q(h)]

= lnh∼EP[e2(m−1)∆(h)

2 ], (31.2)

where the inequality follows from Jensen’s inequality and the concavity of the log function. Therefore,

ES[ef(S)]≤ESh∼EP[e2(m−1)∆(h)

2 ]. (31.3)

The advantage of the expression on the right-hand side stems from the fact that we can switch the order of expectations (becauseP is a prior that does not depend onS), which yields

E S

[ef(S)]≤ E h∼P

E

S

[e2(m−1)∆(h)

2 ]. (31.4)

Understanding Machine Learning: From Theory to Algorithms

LD(Q)≤LS(Q) +

√

,

P

. (31.1)

(

)

.

E

Get our desktop app

Company

Features

Documentation

Resources