Bandit Algorithms

(Jeff_L) #1
9.1 The MOSS algorithm 120

Proof LetSt=tμˆt. Then

P


(


existss≥1 : ˆμs+


4


s

log+

(


1



)


+ ∆≤ 0


)


=P


(


existss≥1 :Ss+


4 slog+

(


1



)


+s∆≤ 0

)



∑∞


j=0

P


(


existss∈[2j, 2 j+1] :Ss+


4 slog+

(


1



)


+s∆≤ 0

)



∑∞


j=0

P


(


existss≤ 2 j+1:Ss+


4 · 2 jlog+

(


1


2 j+1δ

)


+ 2j∆≤ 0

)



∑∞


j=0

exp


−


(√


2 j+2log+

( 1


2 j+1δ

)


+ 2j∆

) 2


2 j+2


.


In the first inequality we used the union bound, but rather than applying it on
every time step as we did in the proof of Theorem 8.1, we apply it on a geometric
grid. The second step is straightforward, but important because it sets up to
apply Theorem 9.2. The rest is purely algebraic.

∑∞

j=0

exp



−


(√


2 j+2log+

( 1


2 j+1δ

)


+ 2j∆

) 2


2 j+2



≤δ

∑∞


j=0

2 j+1exp

(


−∆^22 j−^2

)


≤^8 δ
e∆^2


∫∞


0

2 s+1exp

(


−∆^22 s−^2

)


ds≤^15 δ
∆^2

,


where the first inequality follows since (a+b)^2 ≥a^2 +b^2 fora,b≥0 and
the second last step follows by noting that the integrand is unimodal and
has a maximum value of 8∑ δ/(e∆^2 ). For such functionsf one may bound
b
j=af(j)≤maxs∈[a,b]f(s) +


∫b
af(s)ds.
Proof of Theorem 9.1 As usual, we assume without loss of generality that the
first arm is optimal, soμ 1 =μ∗. Arguing that the optimal arm is sufficiently
optimistic with high probability is no longer satisfactory because in this refined
analysis the probability that an arm is played linearly often needs to depend on
its suboptimality gap. A way around this difficulty is to make an argument in
terms of the expected amount of optimism. Define a random variable ∆ that
measures how far below the index of the optimal arm drops below its true mean.

∆ =


(


μ 1 −mins≤n

(


μˆ 1 s+


4


s

log+

(n
ks

)))+


.


Arms with suboptimality gaps much larger than ∆ will not be played too often,
while arms with suboptimality gaps smaller than ∆ may be played linearly often,

Free download pdf