Bandit Algorithms

(Jeff_L) #1
15.5 Exercises 194

(a)Use Pinsker’s inequality (Eq. 14.11) and Lemma 15.1 and the result of
Exercise 14.4 to show


Ei[Ti(n)]≤E 0 [Ti(n)] +n


1


4


∆^2 E 0 [Ti(n)] =E 0 [Ti(n)] +c


nkE 0 [Ti(n)].

(b)Using the previous part, Jensen’s inequality and the identity


∑k
i=1E^0 [Ti(n)] =
n, show that

∑k

i=1

Ei[Ti(n)]≤n+c

∑k

i=1


nkE 0 [Ti(n)]≤n+ckn.

(c) LetRi=Rn(π,Gμ(i)). Find a choice ofc >0 for which

∑k

i=1

Ri= ∆

∑k

i=1

(n−Ei[Ti(n)])≥∆ (nk−n−ckn)

= 2c


k
n

(nk−n−ckn)≥

nk
8


k
n

(d) Conclude there exists ani∈[k] such that


Ri≥

1


8



kn.

The method used in this exercise is borrowed from Auer et al. [2002b] and
is closely related to the lower bound technique known as Assouad’s method
in statistics [Yu, 1997].

15.3(Lower bound for small horizons) Letk >1 andn < k. Prove that
for any policyπthere exists a Gaussian bandit with unit variance and means
μ∈[0,1]ksuch thatRn(π,νμ)≥n(2k−n−1)/(2k)> n/2.


15.4(Lower bounds for Bernoulli bandits) Recall from Table 4.1 that
EBkis the set ofk-armed Bernoulli bandits. Show that there exists a universal
constantc >0 such that for any 2≤k≤nit holds that:


Rn∗(EBk) = infπ sup
ν∈EBk

Rn(π,ν)≥c


nk.

Hint Use the fact that KL divergence is upper bounded by theχ-squared
distance (Eq. (14.15)).

15.5In Chapter 9 we proved that ifπis the MOSS policy andν∈ESGk (1), then


Rn(π,ν)≤C




kn+


i:∆i> 0

∆i


,

Free download pdf