Bandit Algorithms

(Jeff_L) #1
12.5 Exercises 165

[Uchibe and Doya, 2004, Wawrzynski and Pacut, 2007, Ionides, 2008, Bottou
et al., 2013]. All of these papers are based on truncating the estimators, which
makes the resulting estimator less smooth. Surprisingly, the variance reduction
technique used in this chapter seems to be recent [Koc ́ak et al., 2014].

12.5 Exercises


12.1(Exp3.P) In this exercise we ask you to analyze the Exp3.P algorithm,
which as we mentioned in the notes is another way to obtain high probability
bounds. The idea is to modify Exp3 by biasing the estimators and introducing
some forced exploration. LetYˆti=Atiyti/Pti−β/Ptibe a biased version of the
loss-based importance-weighted estimator that was used in the previous chapter.
DefineLˆti=

∑t
s=1Yˆsiand consider the policy that samplesAt∼Ptwhere

Pti= (1−γ)P ̃ti+

γ
k

with P ̃ti=

exp

(


−ηLˆt− 1 ,i

)


∑k
j=1exp

(


−ηLˆt− 1 ,j

).


(a)Letδ∈(0,1) andi∈[k]. Show that with probability 1−δ, the random regret
Rˆniagainsti(cf. (12.2)) satisfies

Rˆni< nγ+ (1−γ)

∑n

t=1

∑k

j=1

P ̃tj(Yˆtj−yti) +

∑n

t=1

β
PtAt

+



nlog(1/δ)
2

.


(b) Show that
∑n

t=1

∑k

j=1

P ̃tj(Yˆtj−yti) =

∑n

t=1

∑k

j=1

P ̃tj(Yˆtj−Yˆti) +

∑n

t=1

(Yˆti−yti).

(c) Show that
∑n

t=1

∑k

j=1

P ̃tj(Yˆtj−Yˆti)≤log(k)
η


∑n

t=1

∑k

j=1

P ̃tjYˆtj^2.

(d) Show that
∑n

t=1

∑k

j=1

P ̃tjYˆtj^2 ≤nk

(^2) β 2
γ


+


∑n

t=1

1


PtAt

.


(e)Apply the result of Exercise 5.15 to show that for anyδ∈(0,1), the following
hold:

P

(n

t=1

1


PtAt

≥ 2 nk+

k
γ

log

(


1


δ

))


≤δ.

P


(n

t=1

Yˆti−yti≥^1
β

log

(


1


δ

))


≤δ.
Free download pdf