Bandit Algorithms

17.3 Notes 209

straightforward, but for adversarial bandits there is an additional step. Notice that underQiit holds thatXti−XtAt≥0 and that ifXti,XtAt∈(0,1) and At 6 =i, thenXti−XtAt≥∆. From this we conclude that

Rˆn≥∆

(

n−Ti(n)−

∑n

t=1

I{existsj∈[k] :Xtj∈{ 0 , 1 }}

)

. (17.8)

The following claim upper bounds the number of rounds in which clipping occurs with high probability.

Claim17.7.Ifσ= 1/ 10 and∆< 1 / 8 andn≥32 log(1/δ), then

PQi

(n ∑

t=1

I{existsj∈[k] :Xtj∈{ 0 , 1 }}≥

n 4

)

≤δ.

Combining Claim 17.6 and Claim 17.7 with Eq. (17.8) shows there exists an armisuch that

PQi

(

Rˆn≥n∆ 4

)

≥δ ,

which by the definition of ∆ and Claim 17.5 implies Theorem 17.4.

17.3 Notes

1 The adversarial bandits used in Section 17.2 had the interesting property that the same arm has the best reward in every round (not just the best mean). This cannot be exploited by an algorithm, however, because it only gets a single observation in each round. 2 In Theorem 17.4 we did not make any assumptions on the algorithm. If we had assumed the algorithm enjoyed an expected regret bound ofRn≤B

√

kn, then we could conclude that for each sufficiently smallδ∈(0,1) there exists an adversarial bandit such that

P

(

Rˆn≥c B

√

knlog

(

1

2 δ

))

≥δ ,

which shows that our high probability upper bounds for Exp3-IX are nearly tight.

17.4 Bibliographic remarks

The results in this chapter are by Gerchinovitz and Lattimore [2016], who also provide lower bounds on what is achievable when the loss matrix exhibits nice structure such as low variance or similarity between losses of the arms.

Bandit Algorithms

(

)

. (17.8)

)

(

)

17.3 Notes

√

(

√

(

1

))

17.4 Bibliographic remarks

Get our desktop app

Company

Features

Documentation

Resources