17.1 Stochastic bandits 205
On the other hand, if the high-probability bound only holds for a singleδas in
(17.2), then it seems hard to do much better than
Rn≤nδ+O
(√
knlog
(
k
δ
))
,
which with the best choice ofδleads to a bound ofO(
√
knlog(n)).
17.1 Stochastic bandits
For simplicity we start with the stochastic setting before explaining how to
convert the arguments to the adversarial model. There is no randomness in the
expected regret, so in order to derive a high probability bound we define the
random pseudo regretby
R ̄n=
∑k
i=1
Ti(n)∆i,
which is a random variable through the pull countsTi(n).
For all results in this section we letEk⊂ ENk denote the set ofk-armed
Gaussian bandits with suboptimality gaps bounded by one. Forμ∈[0,1]d
we letνμ∈Ekbe the Gaussian bandit with meansμ.
Theorem17.1.Letn≥ 1 andk≥ 2 andB > 0 andπbe a policy such that for
anyν∈Ek,
Rn(π,ν)≤B
√
(k−1)n. (17.4)
Letδ∈(0,1). Then there exists a banditνinEksuch that
P
(
R ̄n(π,ν)≥^1
4
min
{
n,
1
B
√
(k−1)nlog
(
1
4 δ
)})
≥δ.
Proof Let ∆∈(0, 1 /2] be a constant to be tuned subsequently andν=νμwhere
the mean vectorμ∈Rdis defined byμ 1 = ∆ andμi= 0 fori >1. Abbreviate
Rn=Rn(π,ν) andP=PνπandE=Eνπ. Leti=argmini> 1 E[Ti(n)]. Then by
Lemma 4.5 and the assumption in Eq. (17.4),
E[Ti(n)]≤ Rn
∆(k−1)
≤B
∆
√
n
k− 1
. (17.5)
Define alternative banditν′=νμ′whereμ′∈Rdis equal toμexceptμ′i=μi+2∆.
AbbreviateP′=Pν′πandR ̄n=R ̄n(π,ν) andR ̄n′ =R ̄n(π,ν′). By Lemma 4.5