Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

21 Optimal Design for Least Squares Estimators

In the preceding chapter we showed how to construct confidence intervals for least squares estimators when the design is chosen sequentially. We now study the problem of choosing actions for which the resulting confidence sets are small. This plays an important role in the analysis of stochastic linear bandits with finitely many arms (Chapter 22) and adversarial linear bandits (Part VI).

21.1 The Kiefer–Wolfowitz theorem

Letη 1 ,...,ηnbe a sequence of independent 1-subgaussian random variables and a 1 ,...,an∈Rdbe a fixed sequence withspan(a 1 ,...,an) =RdandX 1 ,...,Xn be given byXt=〈at,θ∗〉+ηtfor someθ∗∈Rd. The least squares estimator of θ∗isθˆ=V−^1

∑n t=1atXtwithV=

∑n t=1ata

>t.

The least squares estimator used here is not regularized. This eases the calculations and the lack of regularization will not harm us in future applications.

Eq. (20.2) from Chapter 20 shows that for anya∈Rdandδ∈(0,1),

P

(

〈θˆ−θ∗,a〉≥

√

2 ‖a‖^2 V− 1 log

(

1

δ

))

≤δ. (21.1)

For our purposes, botha 1 ,...,anandawill be actions from some (possibly infinite) setA⊂Rdand the question of interest is finding the shortest sequence of exploratory actionsa 1 ,...,ansuch that the confidence bound in the previous display is smaller than some threshold for alla∈ A. To solve this exactly is likely an intractable exercise in integer programming. Finding an accurate approximation turns out to be efficient for a broad class of action sets, however. Letπ:A→[0,1] be a distribution onAso that

∑

a∈Aπ(a) = 1 andV(π)∈R

d×d andg(π)∈Rbe given by

V(π) =

∑

a∈A

π(a)aa>, g(π) = max a∈A

‖a‖^2 V(π)− 1. (21.2)

Bandit Algorithms

21 Optimal Design for Least Squares Estimators

21.1 The Kiefer–Wolfowitz theorem

P

(

√

(

1

))

∑

∑

Get our desktop app

Company

Features

Documentation

Resources