Bandit Algorithms

(Jeff_L) #1
19.4 Notes 234

made in Theorem 19.2 depends on the dimensiondand becomes vacuous when
dis large or infinite. This dependence arises from Lemma 19.4. It is possible to
modify this result by replacingdwith a data-dependent quantity that measures
the ‘effective dimension’ of the image of the data underφ. The final challenge
is to define an appropriate confidence set. These issues have not yet been
resolved in a complete way. See the bibliographic remarks for further details
and references.
2 The bound given in Theorem 19.2 is essentially a worst-case style of bound,
with little dependence on the parameterθ∗or the geometry of the action set.
Instance-dependent bounds for linear bandits are still an open topic of research,
and the asymptotics are only understood in the special case where the action
set is finite and unchanging (Chapter 25).
3 In the worst case, the bound in Theorem 19.2 is tight up to logarithmic factors.
More details are in Chapter 24, which is devoted to lower bounds for stochastic
linear bandits. The environments for which the lower bound nearly matches the
upper bound have action sets that are either infinite or exponentially large in
the dimension. When|At|≤kfor all roundst, there are algorithms for which
the regret is


Rn=O

(√


dnlog^3 (nk)

)


.


The special case where the action set does not change with time is treated in
Chapter 22 where references to the literature are also provided.
4 The calculation in Eq. (19.8) shows that LinUCB has more than just a passing
resemblance to the UCB algorithm introduced in Chapter 7. The term〈a,θˆt〉
may be interpreted as an empirical estimate of the reward from choosing action
aand



βt‖a‖Vt−− 11 is a bonus term that ensures sufficient exploration. If the
penalty term vanishes (λ= 0) andAt={e 1 ,...,ed}for allt∈[n], thenˆθi
becomes the empirical mean of actioneiand the matrixVtis diagonal with
itsidiagonal entry being the number of times actioneiis used up to and
including roundt. Then the bonus term has order

βt‖ei‖Vt−− 11 =


βt
Ti(t−1)

,


whereTi(t−1) is the number of times actioneihas been chosen before thetth
round. So UCB for finite-armed bandits is recovered by choosingβt= 2log(·),
where the term inside the logarithm can be chosen in a variety of ways as
discussed in earlier chapters. Notice now that the simple analysis given in this
chapter leads to a regret bound ofO(


dnlog(·)), which is quite close to the
highly specialized analysis given in Chapters 7 to 9.
5 An extension of considerably interest of the linear model is thegeneralized
linear modelwhere the reward is


Xt=g−^1 (〈At,θ∗〉+ηt), (19.9)
Free download pdf