Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

20 Confidence Bounds for Least Squares Estimators


In the last chapter we derived a regret bound for a version of the upper confidence
bound algorithm that depended on a particular kind of confidence set. The
purpose of this chapter is to justify these choices.
Suppose a bandit algorithm has chosen actionsA 1 ,...,At∈Rdand received
the rewardsX 1 ,...,XtwithXs=〈As,θ∗〉+ηswhereηsis zero-mean noise.
Recall from the previous chapter that the penalized least-squares estimate ofθ∗
is the minimizer of

Lt(θ) =

∑t

s=1

(Xs−〈As,θ〉)^2 +λ‖θ‖^22 ,

whereλ≥0 is the penalty factor. This is minimized by

θˆt=Vt(λ)−^1

∑t

s=1

XsAs withVt(λ) =λI+

∑t

s=1

AsA>s. (20.1)

It is convenient for the remainder to abbreviateVt=Vt(0). Designing confidence
sets forθ∗ when A 1 ,...,At have been chosen by a bandit algorithm is a
surprisingly delicate matter. The difficulty stems from the fact that the actions
are neither fixed nor independent, but are intricately correlated via the rewards.
We spend the first section of this chapter building intuition by making some
simplifying assumptions. Eager readers may skip directly to Section 20.1. For the
rest of this section we assume that:

1 No regularization:λ= 0 andVtis invertible.
2 Independent subgaussian noise:(ηs)sare independent and 1-subgaussian.
3 Fixed design:A 1 ,...,Atare deterministically chosen without the knowledge of
X 1 ,...,Xt.

None of these assumptions is plausible in the bandit setting, but the simplification
eases the analysis and provides insight.

The assumption thatλ= 0 means that in this sectionθˆtis just the ordinary
least squares estimator ofθ. The requirement thatVtbe nonsingular means
that (As)ts=1must spanRdand sotmust be at leastd.
Free download pdf