Bandit Algorithms

15.5 Exercises 194

(a)Use Pinsker’s inequality (Eq. 14.11) and Lemma 15.1 and the result of
Exercise 14.4 to show

Ei[Ti(n)]≤E 0 [Ti(n)] +n

√

1

4

∆^2 E 0 [Ti(n)] =E 0 [Ti(n)] +c

√

nkE 0 [Ti(n)].

(b)Using the previous part, Jensen’s inequality and the identity

∑k i=1E^0 [Ti(n)] = n, show that

∑k

i=1

Ei[Ti(n)]≤n+c

∑k

i=1

√

nkE 0 [Ti(n)]≤n+ckn.

(c) LetRi=Rn(π,Gμ(i)). Find a choice ofc >0 for which

∑k

i=1

Ri= ∆

∑k

i=1

(n−Ei[Ti(n)])≥∆ (nk−n−ckn)

= 2c

√

k n

(nk−n−ckn)≥

nk 8

√

k n

(d) Conclude there exists ani∈[k] such that

Ri≥

1

8

√

kn.

The method used in this exercise is borrowed from Auer et al. [2002b] and is closely related to the lower bound technique known as Assouad’s method in statistics [Yu, 1997].

15.3(Lower bound for small horizons) Letk >1 andn < k. Prove that
for any policyπthere exists a Gaussian bandit with unit variance and means
μ∈[0,1]ksuch thatRn(π,νμ)≥n(2k−n−1)/(2k)> n/2.

15.4(Lower bounds for Bernoulli bandits) Recall from Table 4.1 that
EBkis the set ofk-armed Bernoulli bandits. Show that there exists a universal
constantc >0 such that for any 2≤k≤nit holds that:

Rn∗(EBk) = infπ sup ν∈EBk

Rn(π,ν)≥c

√

nk.

Hint Use the fact that KL divergence is upper bounded by theχ-squared distance (Eq. (14.15)).

15.5In Chapter 9 we proved that ifπis the MOSS policy andν∈ESGk (1), then

Rn(π,ν)≤C





√

kn+

∑

i:∆i> 0

∆i



,

Bandit Algorithms

√

1

4

√

√

√

√

1

8

√

√





√

∑



,

Get our desktop app

Company

Features

Documentation

Resources