Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

20 Confidence Bounds for Least Squares Estimators

In the last chapter we derived a regret bound for a version of the upper confidence bound algorithm that depended on a particular kind of confidence set. The purpose of this chapter is to justify these choices. Suppose a bandit algorithm has chosen actionsA 1 ,...,At∈Rdand received the rewardsX 1 ,...,XtwithXs=〈As,θ∗〉+ηswhereηsis zero-mean noise. Recall from the previous chapter that the penalized least-squares estimate ofθ∗ is the minimizer of

Lt(θ) =

∑t

s=1

(Xs−〈As,θ〉)^2 +λ‖θ‖^22 ,

whereλ≥0 is the penalty factor. This is minimized by

θˆt=Vt(λ)−^1

∑t

s=1

XsAs withVt(λ) =λI+

∑t

s=1

AsA>s. (20.1)

It is convenient for the remainder to abbreviateVt=Vt(0). Designing confidence sets forθ∗ when A 1 ,...,At have been chosen by a bandit algorithm is a surprisingly delicate matter. The difficulty stems from the fact that the actions are neither fixed nor independent, but are intricately correlated via the rewards. We spend the first section of this chapter building intuition by making some simplifying assumptions. Eager readers may skip directly to Section 20.1. For the rest of this section we assume that:

1 No regularization:λ= 0 andVtis invertible. 2 Independent subgaussian noise:(ηs)sare independent and 1-subgaussian. 3 Fixed design:A 1 ,...,Atare deterministically chosen without the knowledge of X 1 ,...,Xt.

None of these assumptions is plausible in the bandit setting, but the simplification eases the analysis and provides insight.

The assumption thatλ= 0 means that in this sectionθˆtis just the ordinary least squares estimator ofθ. The requirement thatVtbe nonsingular means that (As)ts=1must spanRdand sotmust be at leastd.

Bandit Algorithms

20 Confidence Bounds for Least Squares Estimators

Get our desktop app

Company

Features

Documentation

Resources