17.1 Stochastic bandits 206
and Pinsker’s inequality (Theorem 14.2) and the divergence decomposition
(Lemma 15.1) we have
P
(
R ̄n≥∆n
2
)
+P′
(
R ̄′n≥∆n
2
)
≥P
(
Ti(n)≥
n
2
)
+P′
(
Ti(n)<
n
2
)
≥
1
2
exp (−D(P,P′))≥
1
2
exp
(
− 2 B∆
√
n
k− 1
)
≥ 2 δ ,
where the last line follows by choosing
∆ = min
{
1
2 ,
1
2 B
√
k− 1
n log
(
1
4 δ
)}
.
The result follows since max{a,b}≥(a+b)/2.
Corollary17.2. Letn≥ 1 andk≥ 2. Then for any policyπandδ∈(0,1)
such that
nδ≤
√
n(k−1) log
(
1
4 δ
)
(17.6)
there exists a bandit problemν∈Eksuch that
P
(
R ̄n(π,ν)≥^1
4
min
{
n,
√
n(k−1)
2
log
(
1
4 δ
)})
≥δ. (17.7)
Proof We prove the result by contradiction. Assume that the conclusion does
not hold forπand letδ∈(0,1) satisfy(17.6). Then for any bandit problem
ν∈Ekthe expected regret ofπis bounded by
Rn(π,ν)≤nδ+
√
n(k−1)
2
log
(
1
4 δ
)
≤
√
2 n(k−1) log
(
1
4 δ
)
.
Thereforeπsatisfies the conditions of Theorem 17.1 withB=
√
2 log(1/(4δ)),
which implies that there exists some bandit problemν∈ Eksuch that(17.7)
holds, contradicting our assumption.
Corollary17.3.Letk≥ 2 andp∈(0,1)andB > 0. Then there does not exist
a policyπsuch that for al ln≥ 1 ,δ∈(0,1)andν∈Ek,
P
(
R ̄n(π,ν)≥B√(k−1)nlogp
(
1
δ
))
< δ
Proof We proceed by contradiction. Suppose that such a policy exists. Choosing
δsufficiently small andnsufficiently large ensures that
1
B
log
(
1
4 δ
)
≥Blogp
(
1
δ
)
and
1
B
√
n(k−1) log
(
1
4 δ
)
≤n.