37.4 Lower bounds 459
whereCuis a suitably large constant. We note thatu∈Ca∩Cbis not on the
boundary ofPd− 1 , soui>0 for alli. Therefore
D(Pua,Pub)≤Cu
∑
c∈[k]
E[Tc(n)]∆^2. (37.9)
Step 3: Comparing the regret
By Eq. (37.5) and H ̈older’s inequality, forc /∈ Nabwe have〈`c−`a,ua〉 ≥
ε−〈`c−`a,∆q〉≥ε−∆‖q‖ 1 and〈`c−`b,ub〉≥ε−∆‖q‖ 1 , where, for simplicity,
and without the loss of generality, we assumed that the losses lie in [0,1]. Define
T ̃(n) to be the number of times an arm not inNabis played:
T ̃(n) =
∑
c/∈Nab
Tc(n).
By Lemma 37.7, for each actionc∈ Nabthere exists anα∈[0,1] such that
`c=α`a+ (1−α)`b. Therefore by Eq. (37.7),
〈`c−`a,ua〉+〈`c−`b,ub〉= (1−α)〈`b−`a,ua〉+α〈`a−`b,ub〉= ∆,
(37.10)
which means thatmax(〈c−
a,ua〉,〈c−
b,ub〉)≥∆/2. DefineT ̄(n) as the
number of times an arm inNabis played that is at least ∆/2 suboptimal inua:
T ̄(n) =
∑
c∈Nab
I
{
〈`c−`a,ua〉≥
∆
2
}
Tc(n).
It also follows from(37.10) that if c ∈ Nab and 〈`c−`a,ua〉 < ∆ 2 then
〈`c−`b,ub〉≥∆ 2. Hence, underubthe random pseudo-regret,
∑
cTc(n)〈c−
b,ub〉,
is at least (n−T ̄(n))∆/2. Assume that ∆ is chosen sufficiently small so that
∆‖q‖ 1 ≤ε/2. By the above,
Rn(π,ua) +Rn(π,ub)
=Eua
∑
c∈[k]
Tc(n)〈`c−`a,ua〉
+Eub
∑
c∈[k]
Tc(n)〈`c−`b,ub〉
≥
ε
2
Eua
[ ̃
T(n)
]
+
n∆
4
(
Pua(T ̄(n)≥n/2) +Pub(T ̄(n)< n/2)
)
≥
ε
2
Eua
[ ̃
T(n)
]
+
n∆
8
exp (−D(Pua,Pub))
≥
ε
2 Eua
[ ̃
T(n)
]
+
n∆
8 exp
(
−Cu∆^2 Eua
[ ̃
T(n)
])
,
where the second inequality follows from Theorem 14.2 and the third from
Eqs. (37.8) and (37.9). The bound is completed by choosing ∆ =ε/(2‖q‖ 1 n^1 /^3 )
(which is finite sinceq 6 = 0) and straightforward optimization (Exercise 37.5).
We leave the following theorems as exercises for the reader (Exercises 37.6
and 37.7).