Bandit Algorithms

27.4 Notes 306

Proof of Theorem 27.3 Choosingγ=dηensures that|η〈a,Yˆt〉|≤1 for alla∈A. The standard argument (Exercise 27.9) shows that

Rn≤

1

η

E



log



∫ vol(A) Aexp

(

−η

∑n t=1(Yˆt(a)−Yˆt(a∗))

)

da







+ 3ηdn. (27.5)

Using again thatη|〈a,Yˆt〉|≤1 and Proposition 27.4 withu=η

∑n t=1Yˆtshows that

Rn≤

1 +dlog(n) η

+ 3ηdn≤ 2

√

3 dn(1 +dlog(2n)).

27.4 Notes

1 A naive implementation of Algorithm 15 has computation complexityO(kd+d^3 ) per round. There is also the one-off cost of computing the exploration distribution, the complexity of which was discussed in Chapter 21. The real problem is thatkcan be extremely large. This is especially true when the action set is combinatorial. For example, whenA={a∈Rd:ai=± 1 }is the corners of the hypercube|A|= 2d, which is much too large unless the dimension is small. Such problems call for a different approach that we present in the next chapter and in Chapter 30. 2 It is not important to find exactly the optimal exploration distribution. All that is needed is a bound on Eq. (27.3), which for the exploration distribution based on the Kiefer-Wolfowitz theorem is justd. However, unlike in the finite case, exploration is crucial and cannot be removed (Exercise 27.8). 3 TheO(

√

n) dependence of the regret on the horizon is not improvable, but the linear dependence on the dimension is suboptimal for certain action sets and optimal for others. An example where improvement is possible occurs whenA is the unit ball, which is analyzed in the next chapter. 4 A slight modification of the setup allows the action set to change in each round, but where actions have identities. Suppose thatk∈{ 1 , 2 ,...}andAt= {a 1 (t),...,ak(t)}and the adversary chooses losses so thatmaxa∈At|〈a,yt〉|≤ 1 for allt. Then a straightforward adaptation of Algorithm 15 and Theorem 27.1 leads to an algorithm for which

Rn= max i∈[k]

E

[n ∑

t=1

〈At−ai(t),yt〉

]

≤ 2

√

3 dnlog(k).

The definition of the regret still compares the learner to the best single action in hindsight, which makes it less meaningful than the definition of the regret in Chapter 19 for stochastic linear bandits with changing action sets. These differences are discussed in more detail in Chapter 29. See also Exercise 27.5.

Bandit Algorithms

1

E





(

)







√

27.4 Notes

√

E

]

≤ 2

√

Get our desktop app

Company

Features

Documentation

Resources