Bandit Algorithms

22.2 Bibliographic remarks 259

22.2 Bibliographic remarks

The algorithms achieving Eq. (22.1) for changing action sets are SupLinRel [Auer, 2002] and BaseLinUCB Chu et al. [2011]. Both introduce phases to decouple the dependence of the design on the outcomes. Unfortunately the analysis of these algorithms is long and technical, which prohibited us from presenting the ideas here. These algorithms are also not the most practical relative to LinUCB (Chapter 19) or Thompson sampling (Chapter 36). Of course this does not diminish the theoretical breakthrough. Phased elimination algorithms have appeared in many places, but the most similar to the algorithm presented here is the work on spectral bandits by Valko et al. [2014]. None of these works used the Kiefer–Wolfowitz theorem. This idea is taken from the literature on adversarial linear bandits where John’s ellipsoid has been used to define exploration policies [Bubeck et al., 2012]. For more details on adversarial linear bandits read on to Part VI.

22.3 Exercises

22.1 In this exercise you will prove Theorem 22.1. (a) Use Theorem 21.1 to show that the length of the`th phase is bounded by

T`≤

2 d ε^2 `

log

(

k`(`+ 1) δ

)

+

d(d+ 1) 2 (b)Leta∗∈argmaxa∈A〈a,θ∗〉be the optimal arm and use Theorem 21.1 to show that

P(exists phase`such thata∗∈/A`)≤

δ k. (c)For actionadefine`a=min{`: ∆a< 2 ε`}to be the first phase where the suboptimality gap of armais smaller than 2ε`. Show that

P(a∈A`a)≤δ k (d) Show that with probability at least 1−δthe regret is bounded by

Rn≤C

√

dnlog

(

klog(n) δ

)

,

whereC >0 is a universal constant. (e) Show that this implies Theorem 22.1 for the given choice ofδ.

Bandit Algorithms

22.2 Bibliographic remarks

22.3 Exercises

(

)

+

√

(

)

,

Get our desktop app

Company

Features

Documentation

Resources