Bandit Algorithms

38.8 Notes 496

whered(p,q) is the relative entropy between Bernoulli distributions with biasesp
andqrespectively. Now ∆ will be chosen to satisfy ∆≤ 1 /4. It follows from the
entropy inequalities in Eq. (14.15) that

D(P 0 ,Pj)≤4∆^2 E 0 [Tj]. (38.23)

Using the fact that 0≤Tσ−Tj≤Tσ≤n/D, Exercise 14.4, Pinsker’s inequality
(Eq. (14.11)) and (38.23),

Ej[Tσ−Tj]≥E 0 [Tσ−Tj]−

n D

√

D(P 0 ,Pj) 2

≥E 0 [Tσ−Tj]−

n∆ D

√

2 E 0 [Tj].

Summing overjand applying Cauchy-Schwarz yields

∑k

j=1

Ej[Tσ−Tj]≥

∑k

j=1

E 0 [Tσ−Tj]−

n∆ D

∑k

j=1

√

2 E 0 [Tj]

≥(k−1)E 0 [Tσ]−n∆ D

√

2 kE 0 [Tσ]

≥

c 1 n(k−1) D

−

n∆ D

√

2 c 2 nk D ≥

c 1 n(k−1) 2 D

, (38.24)

where the last inequality follows by choosing

∆ =c^1 (k−1) 2

√

D

2 c 2 nk By Eq. (38.24) there exists aj∈[k] such that

Ej[Tσ−Tj]≥

c 1 n(k−1) 2 Dk

Then for the last step apply Claim 38.11 to show that

Rnj≥c 3 D∆Ej[Tσ−Tj]≥

c^21 c 3 n(k−1)^2 4 k

√

D

2 c 2 nk. Naive bounding and simplification concludes the proof.

38.8 Notes

1 MDPs in applications can have millions (or “Billions and Billions”) of states, which should make the reader worried that the bound in Theorem 38.6 could be extremely large. The takeaway should be that learning in large MDPs without additional assumptions is hard, as attested by the lower bound in Theorem 38.7.

Bandit Algorithms

√

√

√

√

≥

−

√

, (38.24)

√

D

√

D

38.8 Notes

Get our desktop app

Company

Features

Documentation

Resources