Bandit Algorithms

(Jeff_L) #1
37.4 Lower bounds 459

whereCuis a suitably large constant. We note thatu∈Ca∩Cbis not on the
boundary ofPd− 1 , soui>0 for alli. Therefore


D(Pua,Pub)≤Cu


c∈[k]

E[Tc(n)]∆^2. (37.9)

Step 3: Comparing the regret
By Eq. (37.5) and H ̈older’s inequality, forc /∈ Nabwe have〈`c−`a,ua〉 ≥
ε−〈`c−`a,∆q〉≥ε−∆‖q‖ 1 and〈`c−`b,ub〉≥ε−∆‖q‖ 1 , where, for simplicity,
and without the loss of generality, we assumed that the losses lie in [0,1]. Define
T ̃(n) to be the number of times an arm not inNabis played:

T ̃(n) =


c/∈Nab

Tc(n).

By Lemma 37.7, for each actionc∈ Nabthere exists anα∈[0,1] such that
`c=α`a+ (1−α)`b. Therefore by Eq. (37.7),

〈`c−`a,ua〉+〈`c−`b,ub〉= (1−α)〈`b−`a,ua〉+α〈`a−`b,ub〉= ∆,
(37.10)

which means thatmax(〈c−a,ua〉,〈c−b,ub〉)≥∆/2. DefineT ̄(n) as the
number of times an arm inNabis played that is at least ∆/2 suboptimal inua:


T ̄(n) =


c∈Nab

I


{


〈`c−`a,ua〉≥


2


}


Tc(n).

It also follows from(37.10) that if c ∈ Nab and 〈`c−`a,ua〉 < ∆ 2 then
〈`c−`b,ub〉≥∆ 2. Hence, underubthe random pseudo-regret,


cTc(n)〈c−b,ub〉,
is at least (n−T ̄(n))∆/2. Assume that ∆ is chosen sufficiently small so that
∆‖q‖ 1 ≤ε/2. By the above,


Rn(π,ua) +Rn(π,ub)

=Eua




c∈[k]

Tc(n)〈`c−`a,ua〉


+Eub




c∈[k]

Tc(n)〈`c−`b,ub〉




ε
2

Eua

[ ̃


T(n)

]


+


n∆
4

(


Pua(T ̄(n)≥n/2) +Pub(T ̄(n)< n/2)

)



ε
2

Eua

[ ̃


T(n)

]


+


n∆
8

exp (−D(Pua,Pub))


ε
2 Eua

[ ̃


T(n)

]


+


n∆
8 exp

(


−Cu∆^2 Eua

[ ̃


T(n)

])


,


where the second inequality follows from Theorem 14.2 and the third from
Eqs. (37.8) and (37.9). The bound is completed by choosing ∆ =ε/(2‖q‖ 1 n^1 /^3 )
(which is finite sinceq 6 = 0) and straightforward optimization (Exercise 37.5).


We leave the following theorems as exercises for the reader (Exercises 37.6
and 37.7).
Free download pdf