Bandit Algorithms

25.2 Clouds looming for optimism 285

The constraints mean that
1
α(a 3 )ε^2 γ^2 +α(a 2 )=α(alim 1 )→∞‖a^2 ‖

(^2) H(α)− 1 ≤^1
2
γ^2 ε^2
α(a 3 )ε^2 γ^2 +α(a 2 )
= lim
α(a 1 )→∞
‖a 3 ‖^2 H(α)− 1 ≤
ε^2
2

.

Provided thatγ≥1 this reduces to the constraint that

α(a 3 )ε^2 +α(a 2 )≥ 2 γ^2.

Since we are minimizingα(a 2 ) +εα(a 3 ) we can easily see thatα(a 2 ) = 2γ^2 and
α(a 3 ) = 0 provided that 2γ^2 ≤ 2 /ε. Therefore ifεis chosen sufficiently small
relative toγ, then the optimal rate of the regret isc(A,θ) = 2γ^2 and so by
Theorem 25.3 there exists a policy such that

lim sup n→∞

Rn(A,θ) log(n)

= 2γ^2.

Now we argue that forγsufficiently large andεarbitrarily small that the regret for any consistent optimistic algorithm is at least

lim sup n→∞

Rn(A,θ) log(n)

= Ω(1/ε),

which can be arbitrarily worse than the optimal rate! So why is this so? Recall
that optimistic algorithms choose

At= argmaxa∈Amax ̃ θ∈Ct

〈

a,θ ̃

〉

,

whereCt⊂Rdis a confidence set that we assume contains the trueθwith high
probability. So far this does not greatly restrict the class of algorithms that we
might call optimistic. We now assume that there exists a constantc >0 such
that

Ct⊆

{

̃θ:‖θˆt−θ ̃‖Vt≤c√log(n)

}

,

whereVt=

∑t s=1AsA>s. So now we ask how often can we expect the optimistic algorithm to choose actiona 2 =e 2 in the example described above? Since we have assumedθ∈Ctwith high probability we have that

max ̃ θ∈Ct

〈a 1 ,θ ̃〉≥ 1.

On the other hand, ifTa 2 (t−1)> 4 c^2 log(n), then

max ̃ θ∈Ct

〈a 2 ,θ ̃〉= max ̃ θ∈Ct

〈a 2 ,θ ̃−θ〉≤ 2 c

√

‖a 2 ‖Vt− 1 log(n)≤ 2 c

√

log(n) Ta 2 (t−1)

< 1 ,

which means thata 2 will not be chosen more than 1 + 4c^2 log(n) times. So if
γ= Ω(c^2 ), then the optimistic algorithm will not choosea 2 sufficiently often
and a simple computation shows it must choosea 3 at least Ω(log(n)/ε^2 ) times

Bandit Algorithms

.

〈

〉

,

{

}

,

√

√

< 1 ,

Get our desktop app

Company

Features

Documentation

Resources