37.4 Lower bounds 459whereCuis a suitably large constant. We note thatu∈Ca∩Cbis not on the
boundary ofPd− 1 , soui>0 for alli. Therefore
D(Pua,Pub)≤Cu∑
c∈[k]E[Tc(n)]∆^2. (37.9)Step 3: Comparing the regret
By Eq. (37.5) and H ̈older’s inequality, forc /∈ Nabwe have〈`c−`a,ua〉 ≥
ε−〈`c−`a,∆q〉≥ε−∆‖q‖ 1 and〈`c−`b,ub〉≥ε−∆‖q‖ 1 , where, for simplicity,
and without the loss of generality, we assumed that the losses lie in [0,1]. Define
T ̃(n) to be the number of times an arm not inNabis played:T ̃(n) =∑
c/∈NabTc(n).By Lemma 37.7, for each actionc∈ Nabthere exists anα∈[0,1] such that
`c=α`a+ (1−α)`b. Therefore by Eq. (37.7),〈`c−`a,ua〉+〈`c−`b,ub〉= (1−α)〈`b−`a,ua〉+α〈`a−`b,ub〉= ∆,
(37.10)which means thatmax(〈c−a,ua〉,〈c−b,ub〉)≥∆/2. DefineT ̄(n) as the
number of times an arm inNabis played that is at least ∆/2 suboptimal inua:
T ̄(n) =∑
c∈NabI
{
〈`c−`a,ua〉≥∆
2
}
Tc(n).It also follows from(37.10) that if c ∈ Nab and 〈`c−`a,ua〉 < ∆ 2 then
〈`c−`b,ub〉≥∆ 2. Hence, underubthe random pseudo-regret,∑
cTc(n)〈c−b,ub〉,
is at least (n−T ̄(n))∆/2. Assume that ∆ is chosen sufficiently small so that
∆‖q‖ 1 ≤ε/2. By the above,
Rn(π,ua) +Rn(π,ub)=Eua
∑
c∈[k]Tc(n)〈`c−`a,ua〉
+Eub
∑
c∈[k]Tc(n)〈`c−`b,ub〉
≥
ε
2Eua[ ̃
T(n)]
+
n∆
4(
Pua(T ̄(n)≥n/2) +Pub(T ̄(n)< n/2))
≥
ε
2Eua[ ̃
T(n)]
+
n∆
8exp (−D(Pua,Pub))≥ε
2 Eua[ ̃
T(n)]
+
n∆
8 exp(
−Cu∆^2 Eua[ ̃
T(n)])
,
where the second inequality follows from Theorem 14.2 and the third from
Eqs. (37.8) and (37.9). The bound is completed by choosing ∆ =ε/(2‖q‖ 1 n^1 /^3 )
(which is finite sinceq 6 = 0) and straightforward optimization (Exercise 37.5).
We leave the following theorems as exercises for the reader (Exercises 37.6
and 37.7).