Bandit Algorithms

(Jeff_L) #1
22.2 Bibliographic remarks 259

22.2 Bibliographic remarks


The algorithms achieving Eq. (22.1) for changing action sets are SupLinRel [Auer,
2002] and BaseLinUCB Chu et al. [2011]. Both introduce phases to decouple
the dependence of the design on the outcomes. Unfortunately the analysis of
these algorithms is long and technical, which prohibited us from presenting
the ideas here. These algorithms are also not the most practical relative to
LinUCB (Chapter 19) or Thompson sampling (Chapter 36). Of course this does
not diminish the theoretical breakthrough. Phased elimination algorithms have
appeared in many places, but the most similar to the algorithm presented here is
the work on spectral bandits by Valko et al. [2014]. None of these works used the
Kiefer–Wolfowitz theorem. This idea is taken from the literature on adversarial
linear bandits where John’s ellipsoid has been used to define exploration policies
[Bubeck et al., 2012]. For more details on adversarial linear bandits read on to
Part VI.

22.3 Exercises


22.1 In this exercise you will prove Theorem 22.1.
(a) Use Theorem 21.1 to show that the length of the`th phase is bounded by

T`≤

2 d
ε^2 `

log

(


k`(`+ 1)
δ

)


+


d(d+ 1)
2
(b)Leta∗∈argmaxa∈A〈a,θ∗〉be the optimal arm and use Theorem 21.1 to show
that

P(exists phase`such thata∗∈/A`)≤

δ
k.
(c)For actionadefine`a=min{`: ∆a< 2 ε`}to be the first phase where the
suboptimality gap of armais smaller than 2ε`. Show that

P(a∈A`a)≤δ
k
(d) Show that with probability at least 1−δthe regret is bounded by

Rn≤C


dnlog

(


klog(n)
δ

)


,


whereC >0 is a universal constant.
(e) Show that this implies Theorem 22.1 for the given choice ofδ.
Free download pdf