Bandit Algorithms

(Jeff_L) #1
27.4 Notes 306

Proof of Theorem 27.3 Choosingγ=dηensures that|η〈a,Yˆt〉|≤1 for alla∈A.
The standard argument (Exercise 27.9) shows that

Rn≤

1


η

E



log


∫ vol(A)
Aexp

(


−η

∑n
t=1(Yˆt(a)−Yˆt(a∗))

)


da




+ 3ηdn. (27.5)

Using again thatη|〈a,Yˆt〉|≤1 and Proposition 27.4 withu=η

∑n
t=1Yˆtshows
that

Rn≤

1 +dlog(n)
η

+ 3ηdn≤ 2


3 dn(1 +dlog(2n)).

27.4 Notes


1 A naive implementation of Algorithm 15 has computation complexityO(kd+d^3 )
per round. There is also the one-off cost of computing the exploration
distribution, the complexity of which was discussed in Chapter 21. The real
problem is thatkcan be extremely large. This is especially true when the
action set is combinatorial. For example, whenA={a∈Rd:ai=± 1 }is
the corners of the hypercube|A|= 2d, which is much too large unless the
dimension is small. Such problems call for a different approach that we present
in the next chapter and in Chapter 30.
2 It is not important to find exactly the optimal exploration distribution. All
that is needed is a bound on Eq. (27.3), which for the exploration distribution
based on the Kiefer-Wolfowitz theorem is justd. However, unlike in the finite
case, exploration is crucial and cannot be removed (Exercise 27.8).
3 TheO(


n) dependence of the regret on the horizon is not improvable, but the
linear dependence on the dimension is suboptimal for certain action sets and
optimal for others. An example where improvement is possible occurs whenA
is the unit ball, which is analyzed in the next chapter.
4 A slight modification of the setup allows the action set to change in each
round, but where actions have identities. Suppose thatk∈{ 1 , 2 ,...}andAt=
{a 1 (t),...,ak(t)}and the adversary chooses losses so thatmaxa∈At|〈a,yt〉|≤ 1
for allt. Then a straightforward adaptation of Algorithm 15 and Theorem 27.1
leads to an algorithm for which

Rn= max
i∈[k]

E


[n

t=1

〈At−ai(t),yt〉

]


≤ 2



3 dnlog(k).

The definition of the regret still compares the learner to the best single action
in hindsight, which makes it less meaningful than the definition of the regret
in Chapter 19 for stochastic linear bandits with changing action sets. These
differences are discussed in more detail in Chapter 29. See also Exercise 27.5.
Free download pdf