Bandit Algorithms

37.4 Lower bounds 459

whereCuis a suitably large constant. We note thatu∈Ca∩Cbis not on the
boundary ofPd− 1 , soui>0 for alli. Therefore

D(Pua,Pub)≤Cu

∑

c∈[k]

E[Tc(n)]∆^2. (37.9)

Step 3: Comparing the regret By Eq. (37.5) and H ̈older’s inequality, forc /∈ Nabwe have〈`c−`a,ua〉 ≥ ε−〈`c−`a,∆q〉≥ε−∆‖q‖ 1 and〈`c−`b,ub〉≥ε−∆‖q‖ 1 , where, for simplicity, and without the loss of generality, we assumed that the losses lie in [0,1]. Define T ̃(n) to be the number of times an arm not inNabis played:

T ̃(n) =

∑

c/∈Nab

Tc(n).

By Lemma 37.7, for each actionc∈ Nabthere exists anα∈[0,1] such that `c=α`a+ (1−α)`b. Therefore by Eq. (37.7),

〈`c−à,ua〉+〈`c−`b,ub〉= (1−α)〈`b−à,ua〉+α〈à−`b,ub〉= ∆, (37.10)

which means thatmax(〈c−a,ua〉,〈c−b,ub〉)≥∆/2. DefineT ̄(n) as the
number of times an arm inNabis played that is at least ∆/2 suboptimal inua:

T ̄(n) =

∑

c∈Nab

I

{

〈`c−`a,ua〉≥

∆

2

}

Tc(n).

It also follows from(37.10) that if c ∈ Nab and 〈`c−`a,ua〉 < ∆ 2 then 〈`c−`b,ub〉≥∆ 2. Hence, underubthe random pseudo-regret,

∑

cTc(n)〈c−b,ub〉,
is at least (n−T ̄(n))∆/2. Assume that ∆ is chosen sufficiently small so that
∆‖q‖ 1 ≤ε/2. By the above,

Rn(π,ua) +Rn(π,ub)

=Eua





∑

c∈[k]

Tc(n)〈`c−`a,ua〉



+Eub





∑

c∈[k]

Tc(n)〈`c−`b,ub〉





≥

ε 2

Eua

[ ̃

T(n)

]

+

n∆ 4

(

Pua(T ̄(n)≥n/2) +Pub(T ̄(n)< n/2)

)

≥

ε 2

Eua

[ ̃

T(n)

]

+

n∆ 8

exp (−D(Pua,Pub))

≥

ε 2 Eua

[ ̃

T(n)

]

+

n∆ 8 exp

(

−Cu∆^2 Eua

[ ̃

T(n)

])

,

where the second inequality follows from Theorem 14.2 and the third from
Eqs. (37.8) and (37.9). The bound is completed by choosing ∆ =ε/(2‖q‖ 1 n^1 /^3 )
(which is finite sinceq 6 = 0) and straightforward optimization (Exercise 37.5).

We leave the following theorems as exercises for the reader (Exercises 37.6 and 37.7).

Bandit Algorithms

∑

∑

∑

I

{

∆

2

}

∑





∑







∑





≥

[ ̃

]

+

(

)

≥

[ ̃

]

+

[ ̃

]

+

(

[ ̃

])

,

Get our desktop app

Company

Features

Documentation

Resources