Bandit Algorithms

6.1 Algorithm and regret analysis 88

is written formally as

ˆμi(t) =

1

Ti(t)

∑t

s=1

I{As=i}Xs,

whereTi(t) =

∑t s=1I{As=i}is the number of times actionihas been played after roundt. The explore-then-commit policy is given in Algorithm 1 below.

1:Input m. 2:In roundtchoose action

At=

{

(tmodk) + 1, ift≤mk; argmaxiμˆi(mk), t > mk. (ties in the argmax are broken arbitrarily) Algorithm 1:Explore-then-commit.

Theorem6.1. When ETC is interacting with any 1 -subgaussian bandit and
1 ≤m≤n/k,

Rn≤m

∑k

i=1

∆i+ (n−mk)

∑k

i=1

∆iexp

(

−m∆

(^2) i
4

)

.

Proof Assume without loss of generality that the first arm is optimal, which means thatμ 1 =μ∗=maxiμi. By the decomposition given in Lemma 4.5 the regret can be written as

Rn=

∑k

i=1

∆iE[Ti(n)]. (6.1)

In the firstmkrounds the policy is deterministic, choosing each action exactly mtimes. Subsequently it chooses a single action maximizing the average reward during exploration. Thus,

E[Ti(n)]≤m+ (n−mk)P(Amk+1=i)

≤m+ (n−mk)P

(

μˆi(mk)≥max j 6 =i

μˆj(mk)

)

. (6.2)

The probability on the right-hand side is bounded by

P

(

ˆμi(mk)≥max j 6 =i

ˆμj(mk)

)

≤P(ˆμi(mk)≥ˆμ 1 (mk))

=P(ˆμi(mk)−μi−(ˆμ 1 (mk)−μ 1 )≥∆i).

The next step is to check thatˆμi(mk)−μi−(μˆ 1 (mk)−μ 1 ) is

√

2 /m-subgaussian,
which by the properties of subgaussian random variables follows from the

Bandit Algorithms

1

{

(

)

.

(

)

. (6.2)

(

)

√

Get our desktop app

Company

Features

Documentation

Resources