Bandit Algorithms

19.2 Stochastic linear bandits 230

(A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) that contains the unknown parameter vectorθ∗with
high probability. Leaving the details of how the confidence set is constructed
aside for a moment and assuming that the confidence set indeed containsθ∗, for
any given actiona∈Rd, let
UCBt(a) = max
θ∈Ct

〈a,θ〉 (19.2)

be an upper bound on the mean payoff〈a,θ∗〉ofa. The UCB algorithm that uses
the confidence setCtat timetthen selects
At= argmaxa∈AtUCBt(a). (19.3)
UCB applied to linear bandits is known by various names, including LinRel
(LinearReinforcementLearning), LinUCB and OFUL (Optimism in theFace
ofUncertainty forLinear bandits). We will not be very dogmatic of this name
and call algorithms with the above construct instances of LinUCB.
The main question is how to choose the confidence setCt⊂Rd. As usual, there
are conflicting desirable properties:

(a)Ctshould containθ∗with high probability.
(b)Ctshould be as small as possible.

At first sight it is not at all obvious whatCtshould look like. After all, it is a
subset ofRd, not just an interval like the confidence intervals about the empirical
estimate of the mean reward for a single action that we saw in the previous
chapters. While we specify the analytic form of a possible construction forCthere,
there are some details in choosing some of the parameters in this construction. As
there are both delicate and important, we dedicate the next chapter to discussing
them.
Following the idea for UCB, we need an analogue for the empirical estimate
of the unknown quantity, which in this case isθ∗. There are several principles
one might use for deriving such an estimate. For now we use theregularized
least-squares estimator, which is

θˆt= argminθ∈Rd

(t ∑

s=1

(Xs−〈As,θ〉)^2 +λ‖θ‖^22

)

, (19.4)

whereλ≥0 is called thepenalty factor. Choosingλ >0 helps because it
ensures that the loss function has a unique minimizer even whenA 1 ,...,Atdo
not spanRd, which simplifies the math. The solution to Eq. (19.4) is obtained
easily by differentiation and is

θˆt=Vt−^1

∑t

s=1

AsXs, (19.5)

where (Vt)tared×dmatrices given by

V 0 =λI and Vt=V 0 +

∑t

s=1

AsA>s.

Bandit Algorithms

)

, (19.4)

Get our desktop app

Company

Features

Documentation

Resources