Bandit Algorithms

(Jeff_L) #1
17.3 Notes 209

straightforward, but for adversarial bandits there is an additional step. Notice
that underQiit holds thatXti−XtAt≥0 and that ifXti,XtAt∈(0,1) and
At 6 =i, thenXti−XtAt≥∆. From this we conclude that

Rˆn≥∆

(


n−Ti(n)−

∑n

t=1

I{existsj∈[k] :Xtj∈{ 0 , 1 }}

)


. (17.8)


The following claim upper bounds the number of rounds in which clipping occurs
with high probability.

Claim17.7.Ifσ= 1/ 10 and∆< 1 / 8 andn≥32 log(1/δ), then

PQi

(n

t=1

I{existsj∈[k] :Xtj∈{ 0 , 1 }}≥

n
4

)


≤δ.

Combining Claim 17.6 and Claim 17.7 with Eq. (17.8) shows there exists an
armisuch that

PQi

(


Rˆn≥n∆
4

)


≥δ ,

which by the definition of ∆ and Claim 17.5 implies Theorem 17.4.

17.3 Notes


1 The adversarial bandits used in Section 17.2 had the interesting property that
the same arm has the best reward in every round (not just the best mean).
This cannot be exploited by an algorithm, however, because it only gets a
single observation in each round.
2 In Theorem 17.4 we did not make any assumptions on the algorithm. If we
had assumed the algorithm enjoyed an expected regret bound ofRn≤B


kn,
then we could conclude that for each sufficiently smallδ∈(0,1) there exists
an adversarial bandit such that

P

(


Rˆn≥c
B


knlog

(


1


2 δ

))


≥δ ,

which shows that our high probability upper bounds for Exp3-IX are nearly
tight.

17.4 Bibliographic remarks


The results in this chapter are by Gerchinovitz and Lattimore [2016], who also
provide lower bounds on what is achievable when the loss matrix exhibits nice
structure such as low variance or similarity between losses of the arms.
Free download pdf