6.1 Algorithm and regret analysis 88is written formally asˆμi(t) =1
Ti(t)∑ts=1I{As=i}Xs,whereTi(t) =
∑t
s=1I{As=i}is the number of times actionihas been played
after roundt. The explore-then-commit policy is given in Algorithm 1 below.1:Input m.
2:In roundtchoose actionAt={
(tmodk) + 1, ift≤mk;
argmaxiμˆi(mk), t > mk.
(ties in the argmax are broken arbitrarily)
Algorithm 1:Explore-then-commit.Theorem6.1. When ETC is interacting with any 1 -subgaussian bandit and
1 ≤m≤n/k,
Rn≤m∑ki=1∆i+ (n−mk)∑ki=1∆iexp(
−m∆(^2) i
4
)
.
Proof Assume without loss of generality that the first arm is optimal, which
means thatμ 1 =μ∗=maxiμi. By the decomposition given in Lemma 4.5 the
regret can be written asRn=∑ki=1∆iE[Ti(n)]. (6.1)In the firstmkrounds the policy is deterministic, choosing each action exactly
mtimes. Subsequently it chooses a single action maximizing the average reward
during exploration. Thus,E[Ti(n)]≤m+ (n−mk)P(Amk+1=i)≤m+ (n−mk)P(
μˆi(mk)≥max
j 6 =iμˆj(mk))
. (6.2)
The probability on the right-hand side is bounded by
P(
ˆμi(mk)≥max
j 6 =iˆμj(mk))
≤P(ˆμi(mk)≥ˆμ 1 (mk))=P(ˆμi(mk)−μi−(ˆμ 1 (mk)−μ 1 )≥∆i).The next step is to check thatˆμi(mk)−μi−(μˆ 1 (mk)−μ 1 ) is
√
2 /m-subgaussian,
which by the properties of subgaussian random variables follows from the