Bandit Algorithms

9.2 Two problems 122

follows by naively bounding 1/8 + 4 log 8 + 2

√

πlog 8 + 1≤15. Then

E



 ∑

i:∆i>max

{

2∆, 8

√

k/n

}

∆iTi(n)



≤E



 ∑

i:∆i> 8

√

k/n

∆iκi





≤

∑

i:∆i> 8

√

k/n

(

∆i+ 9

√

n k

)

≤ 15

√

nk+

∑k

i=1

∆i.

Combining all the results we haveRn≤ 38

√

kn+

∑k i=1∆i.

9.2 Two problems

MOSS is not the ultimate algorithm. Here we highlight two drawbacks.

Suboptimality relative to UCB Although MOSS is nearly asymptotically optimal (Note 1), all versions of MOSS can be arbitrarily worse than UCB in some regimes. This unpleasantness is hidden by both the minimax and asymptotic optimality criteria, which highlights the importance of fully finite-time upper and lower bounds. The counter-example witnessing the failure is quite simple. Let the rewards for all arms be Gaussian with unit variance andn=k^3 ,μ 1 = 0,μ 2 =−

√

k/nandμi=−1 for alli >2. From Theorem 8.1 we have that

RUCBn =O(klogk),

while it turns out that MOSS has a regret of

RMOSSn = Ω(

√

kn) = Ω(k^2 ).

A rigorous proof of this claim is quite delicate, but we encourage readers to try to understand why it holds intuitively.

Variance There is a hidden cost of pushing too hard to reduce the expected regret, which is that the variance of the regret can grow significantly. We analyze this tradeoff formally in a future chapter, but sketch the intuition here. Consider the two-armed case with suboptimality gap ∆ and Gaussian noise. Then the regret of a carefully tuned algorithm is approximately

Rn=O

(

n∆δ+

1

∆

log

(

1

δ

))

,

Bandit Algorithms

√

E



 ∑

{

√

}



≤E



 ∑

√





≤

∑

√

(

√

)

≤ 15

√

√

9.2 Two problems

√

√

(

1

∆

(

1

))

,

Get our desktop app

Company

Features

Documentation

Resources