Bandit Algorithms

23.2 Elimination on the hypercube 262

So learning the optimal action amounts to learning the sign ofθifor each dimension. A disadvantage of this structure is that in the worst case the sign of eachθimust be learned independently, which in Chapter 24 we show leads to a worst-case regret ofRn= Ω(d

√

n). On the positive side, the seperability means thatθican be estimated in each dimension independently while paying absolutely no price for this experimentation whenθi= 0. It turns out that this allows us to design a policy for whichRn=O(‖θ‖ 0

√

n), even without knowing
the value of‖θ‖ 0.
LetGt=σ(A 1 ,X 1 ,...,At,Xt) be theσ-algebra containing information up to
timet−1 (this differs fromFt, which also includes information about the action
chosen). Now suppose that (Ati)di=1are chosen to be conditionally independent
givenGt− 1 and further assume for some specifici∈[d] thatAtiis sampled from
a Rademacher distribution so thatP(Ati= 1|Gt− 1 )=P(Ati=− 1 |Gt− 1 )= 1/2.
Then

E[AtiXt|Gt− 1 ] =E



Ati





∑d

j=1

Atjθj+ηt









=θiE[A^2 ti|Gt− 1 ] +

∑

j 6 =i

θjE[AtjAti|Gt− 1 ] +E[Atiηt|Gt− 1 ]

=θi,

where the first equality is the definition ofXt=〈At,θ〉+ηt, the second by linearity
of expectation and the third by the conditional independence of (Ati)iand the
fact thatE[Ati|Gt− 1 ] = 0 andE[A^2 ti|Gt− 1 ] = 1. This looks quite promising, but
we should also check the variance. Using our assumptions we have:E[η] = 0 and
E[η^2 ]≤1 and〈a,θ〉≤1 for all actionsawe have

V[AtiXt|Gt− 1 ] =E[A^2 tiXt^2 |Gt− 1 ]−θ^2 i=E[(〈At,θ〉+η)^2 |Gt− 1 ]−θi^2 ≤ 2. (23.2)

And now we have cause for celebration. The value ofθican be estimated by
choosingAtito be a Rademacher random variable independent of the choices in
other dimensions. All the policy does is treat all dimensions independently. For a
particular dimension (sayi) it explores by choosingAti∈{− 1 , 1 }uniformly at
random until its estimate is sufficiently accurate to commit to eitherAti= 1 or
Ati=−1 for all future rounds. How long this takes depends on|θi|, but note that
if|θi|is small, then the price of exploring is also limited. The policy that results
from this idea is called Selective Explore-Then-Commit (Algorithm 13, SETC).

Theorem23.2.There exists a universal constantsC,C′> 0 such that the regret
of SETC satisfies:

Rn≤ 3 ‖θ‖ 1 +C

∑

i:θi 6 =0

log(n) |θi|

and Rn≤ 3 ‖θ 1 ‖+C′‖θ‖ 0

√

nlog(n).

By appealing to the central limit theorem and the variance calculation in

Bandit Algorithms

√

√















∑

∑

√

Get our desktop app

Company

Features

Documentation

Resources