Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

25 Asymptotic Lower Bounds for Stochastic Linear Bandits

The lower bounds in the previous chapter were derived by analyzing the worst case for specific action sets and/or constraints on the unknown parameter. In this chapter we focus on the asymptotics and aim to understand the influence of the action set on the regret. We start with a lower bound, argue that the lower bound can be achieved. We finish by arguing that the optimistic algorithms (and Thompson sampling) will perform arbitrarily worse than what can be achieved by non-optimistic algorithms.

25.1 An asymptotic lower bound for fixed action sets

We assume that A ⊂ Rd is finite with |A| = k and that the reward is Xt=〈At,θ〉+ηtwhereθ∈Rdand (ηt)∞t=1is a sequence of independent standard Gaussian random variables. Of course the regret of a policy in this setting is

Rn(A,θ) =Eθ

[n ∑

t=1

∆At

]

, ∆a= max a′∈A

〈a′−a,θ〉,

where the dependence on the policy is omitted for readability andEθ[·] is the expectation with respect to the measure on outcomes induced by the interaction of the policy and the linear bandit determined byθ. Like the asymptotic lower bounds in the classical finite-armed case (Chapter 16), the results of this chapter are proven only for consistent policies. Recall that a policy is consistent in some class of banditsEif the regret is subpolynomial for any bandit in that class. Here this means that

Rn(A,θ) =o(np) for allp >0 andθ∈Rd. (25.1)

The main objective of the chapter is to prove the following theorem on the behavior of any consistent policy and discuss the implications.

Theorem25.1. Assume thatA ⊂Rdis finite and spansRd and suppose a policy is consistent (satisfies Eq. 25.1). Letθ∈Rdbe any parameter such that there is a unique optimal action and let G ̄n =Eθ

[∑n t=1AtA

>t]. Then

Bandit Algorithms

25 Asymptotic Lower Bounds for Stochastic Linear Bandits

25.1 An asymptotic lower bound for fixed action sets

]

Get our desktop app

Company

Features

Documentation

Resources