Bandit Algorithms

(Jeff_L) #1
9.1 The MOSS algorithm 121

but ∆ is sufficiently small in expectation that this price is small. Using the basic
regret decomposition (Lemma 4.5) and splitting the actions based on whether or
not their suboptimality gap is smaller or larger than 2∆ leads to

Rn=


i:∆i> 0

∆iE[Ti(n)]

≤E



 2 n∆ +


i:∆i>2∆

∆iTi(n)



≤E



 2 n∆ + 8√kn+ ∑

i:∆i>max

{


2∆, 8 √k/n

}


∆iTi(n)


.


The first term is easily bounded using Proposition 2.8 and Lemma 9.3.


E[2n∆] = 2nE[∆] = 2n

∫∞


0

P(∆≥x)dx≤ 2 n

∫∞


0

min

{


1 ,


15 k
nx^2

}


dx= 15


kn.

For suboptimal armidefine

κi=

∑n

s=1

I


{


ˆμis+


4


slog

+

(n
ks

)


≥μi+ ∆i/ 2

}


.


The reason for choosingκiin this way is that for armsiwith ∆i>2∆ it holds
that the index of the optimal arm is always larger thanμi+ ∆i/2 soκiis an
upper bound on the number of times armiis played,Ti(n). If ∆i≥8(k/n)^1 /^2 ,
then the expectation of ∆iκiis bounded using Lemma 8.2 by


∆iE[κi]≤

1


∆i

+ ∆iE

[n

s=1

I


{


μˆis+


4


s

log+

(


n∆^2 i
k

)


≥μi+ ∆i/ 2

}]


≤^1


∆i

+ ∆i+^8
∆i

(


2 log+

(


n∆^2 i
k

)


+



2 πlog+

(


n∆^2 i
k

)


+ 1


)


≤^1


8



n
k

+ ∆i+


n
k

(


4 log 8 + 2


πlog 8 + 1

)


≤∆i+ 15


n
k

,


where the first inequality follows by replacing thesin the logarithm with 1/∆^2 i
and adding the ∆i× 1 /∆^2 i correction term to compensate for the first ∆−i^2
rounds where this fails to hold. Then we use Lemma 8.2 and the monotonicity of
x7→x−^1 −plog+(ax^2 ) forp∈[0,1], positiveaandx≥e/



a. The last inequality
Free download pdf