Bandit Algorithms

(Jeff_L) #1
6.1 Algorithm and regret analysis 88

is written formally as

ˆμi(t) =

1


Ti(t)

∑t

s=1

I{As=i}Xs,

whereTi(t) =


∑t
s=1I{As=i}is the number of times actionihas been played
after roundt. The explore-then-commit policy is given in Algorithm 1 below.

1:Input m.
2:In roundtchoose action

At=

{


(tmodk) + 1, ift≤mk;
argmaxiμˆi(mk), t > mk.
(ties in the argmax are broken arbitrarily)
Algorithm 1:Explore-then-commit.

Theorem6.1. When ETC is interacting with any 1 -subgaussian bandit and
1 ≤m≤n/k,


Rn≤m

∑k

i=1

∆i+ (n−mk)

∑k

i=1

∆iexp

(


−m∆

(^2) i
4


)


.


Proof Assume without loss of generality that the first arm is optimal, which
means thatμ 1 =μ∗=maxiμi. By the decomposition given in Lemma 4.5 the
regret can be written as

Rn=

∑k

i=1

∆iE[Ti(n)]. (6.1)

In the firstmkrounds the policy is deterministic, choosing each action exactly
mtimes. Subsequently it chooses a single action maximizing the average reward
during exploration. Thus,

E[Ti(n)]≤m+ (n−mk)P(Amk+1=i)

≤m+ (n−mk)P

(


μˆi(mk)≥max
j 6 =i

μˆj(mk)

)


. (6.2)


The probability on the right-hand side is bounded by


P

(


ˆμi(mk)≥max
j 6 =i

ˆμj(mk)

)


≤P(ˆμi(mk)≥ˆμ 1 (mk))

=P(ˆμi(mk)−μi−(ˆμ 1 (mk)−μ 1 )≥∆i).

The next step is to check thatˆμi(mk)−μi−(μˆ 1 (mk)−μ 1 ) is



2 /m-subgaussian,
which by the properties of subgaussian random variables follows from the

Free download pdf