Bandit Algorithms

12.5 Exercises 165

[Uchibe and Doya, 2004, Wawrzynski and Pacut, 2007, Ionides, 2008, Bottou et al., 2013]. All of these papers are based on truncating the estimators, which makes the resulting estimator less smooth. Surprisingly, the variance reduction technique used in this chapter seems to be recent [Koc ́ak et al., 2014].

12.5 Exercises

12.1(Exp3.P) In this exercise we ask you to analyze the Exp3.P algorithm, which as we mentioned in the notes is another way to obtain high probability bounds. The idea is to modify Exp3 by biasing the estimators and introducing some forced exploration. LetYˆti=Atiyti/Pti−β/Ptibe a biased version of the loss-based importance-weighted estimator that was used in the previous chapter. DefineLˆti=

∑t s=1Yˆsiand consider the policy that samplesAt∼Ptwhere

Pti= (1−γ)P ̃ti+

γ k

with P ̃ti=

exp

(

−ηLˆt− 1 ,i

)

∑k j=1exp

(

−ηLˆt− 1 ,j

).

(a)Letδ∈(0,1) andi∈[k]. Show that with probability 1−δ, the random regret Rˆniagainsti(cf. (12.2)) satisfies

Rˆni< nγ+ (1−γ)

∑n

t=1

∑k

j=1

P ̃tj(Yˆtj−yti) +

∑n

t=1

β PtAt

+

√

nlog(1/δ) 2

.

(b) Show that ∑n

t=1

∑k

j=1

P ̃tj(Yˆtj−yti) =

∑n

t=1

∑k

j=1

P ̃tj(Yˆtj−Yˆti) +

∑n

t=1

(Yˆti−yti).

(c) Show that ∑n

t=1

∑k

j=1

P ̃tj(Yˆtj−Yˆti)≤log(k) η

+η

∑n

t=1

∑k

j=1

P ̃tjYˆtj^2.

(d) Show that ∑n

t=1

∑k

j=1

P ̃tjYˆtj^2 ≤nk

(^2) β 2
γ

+

∑n

t=1

1

PtAt

.

(e)Apply the result of Exercise 5.15 to show that for anyδ∈(0,1), the following hold:

P

(n ∑

t=1

1

PtAt

≥ 2 nk+

k γ

log

(

1

δ

))

≤δ.

P

(n ∑

t=1

Yˆti−yti≥^1 β

log

(

1

δ

))

≤δ.

Bandit Algorithms

12.5 Exercises

(

)

(

).

+

√

.

+

1

.

1

(

1

))

P

(

1

))

Get our desktop app

Company

Features

Documentation

Resources