17.1 Stochastic bandits 206and Pinsker’s inequality (Theorem 14.2) and the divergence decomposition
(Lemma 15.1) we have
P
(
R ̄n≥∆n
2)
+P′
(
R ̄′n≥∆n
2)
≥P
(
Ti(n)≥n
2)
+P′
(
Ti(n)<n
2)
≥
1
2
exp (−D(P,P′))≥1
2
exp(
− 2 B∆
√
n
k− 1)
≥ 2 δ ,where the last line follows by choosing
∆ = min{
1
2 ,
1
2 B
√
k− 1
n log(
1
4 δ)}
.
The result follows since max{a,b}≥(a+b)/2.
Corollary17.2. Letn≥ 1 andk≥ 2. Then for any policyπandδ∈(0,1)
such that
nδ≤√
n(k−1) log(
1
4 δ)
(17.6)
there exists a bandit problemν∈Eksuch that
P
(
R ̄n(π,ν)≥^1
4min{
n,√
n(k−1)
2log(
1
4 δ)})
≥δ. (17.7)Proof We prove the result by contradiction. Assume that the conclusion does
not hold forπand letδ∈(0,1) satisfy(17.6). Then for any bandit problem
ν∈Ekthe expected regret ofπis bounded byRn(π,ν)≤nδ+√
n(k−1)
2log(
1
4 δ)
≤
√
2 n(k−1) log(
1
4 δ)
.
Thereforeπsatisfies the conditions of Theorem 17.1 withB=
√
2 log(1/(4δ)),
which implies that there exists some bandit problemν∈ Eksuch that(17.7)
holds, contradicting our assumption.
Corollary17.3.Letk≥ 2 andp∈(0,1)andB > 0. Then there does not exist
a policyπsuch that for al ln≥ 1 ,δ∈(0,1)andν∈Ek,
P
(
R ̄n(π,ν)≥B√(k−1)nlogp(
1
δ))
< δProof We proceed by contradiction. Suppose that such a policy exists. Choosing
δsufficiently small andnsufficiently large ensures that
1
Blog(
1
4 δ)
≥Blogp(
1
δ)
and1
B
√
n(k−1) log(
1
4 δ)
≤n.