Bandit Algorithms

(Jeff_L) #1
9.2 Two problems 122

follows by naively bounding 1/8 + 4 log 8 + 2


πlog 8 + 1≤15. Then

E



 ∑


i:∆i>max

{


2∆, 8


k/n

}


∆iTi(n)


≤E



 ∑


i:∆i> 8


k/n

∆iκi







i:∆i> 8


k/n

(


∆i+ 9


n
k

)


≤ 15



nk+

∑k

i=1

∆i.

Combining all the results we haveRn≤ 38


kn+

∑k
i=1∆i.

9.2 Two problems


MOSS is not the ultimate algorithm. Here we highlight two drawbacks.

Suboptimality relative to UCB
Although MOSS is nearly asymptotically optimal (Note 1), all versions of MOSS
can be arbitrarily worse than UCB in some regimes. This unpleasantness is hidden
by both the minimax and asymptotic optimality criteria, which highlights the
importance of fully finite-time upper and lower bounds. The counter-example
witnessing the failure is quite simple. Let the rewards for all arms be Gaussian
with unit variance andn=k^3 ,μ 1 = 0,μ 2 =−


k/nandμi=−1 for alli >2.
From Theorem 8.1 we have that

RUCBn =O(klogk),

while it turns out that MOSS has a regret of

RMOSSn = Ω(


kn) = Ω(k^2 ).

A rigorous proof of this claim is quite delicate, but we encourage readers to try
to understand why it holds intuitively.

Variance
There is a hidden cost of pushing too hard to reduce the expected regret, which
is that the variance of the regret can grow significantly. We analyze this tradeoff
formally in a future chapter, but sketch the intuition here. Consider the two-armed
case with suboptimality gap ∆ and Gaussian noise. Then the regret of a carefully
tuned algorithm is approximately

Rn=O

(


n∆δ+

1



log

(


1


δ

))


,

Free download pdf