Bandit Algorithms

(Jeff_L) #1
23.2 Elimination on the hypercube 262

So learning the optimal action amounts to learning the sign ofθifor each
dimension. A disadvantage of this structure is that in the worst case the sign
of eachθimust be learned independently, which in Chapter 24 we show leads
to a worst-case regret ofRn= Ω(d


n). On the positive side, the seperability
means thatθican be estimated in each dimension independently while paying
absolutely no price for this experimentation whenθi= 0. It turns out that this
allows us to design a policy for whichRn=O(‖θ‖ 0


n), even without knowing
the value of‖θ‖ 0.
LetGt=σ(A 1 ,X 1 ,...,At,Xt) be theσ-algebra containing information up to
timet−1 (this differs fromFt, which also includes information about the action
chosen). Now suppose that (Ati)di=1are chosen to be conditionally independent
givenGt− 1 and further assume for some specifici∈[d] thatAtiis sampled from
a Rademacher distribution so thatP(Ati= 1|Gt− 1 )=P(Ati=− 1 |Gt− 1 )= 1/2.
Then


E[AtiXt|Gt− 1 ] =E


Ati



∑d

j=1

Atjθj+ηt





=θiE[A^2 ti|Gt− 1 ] +


j 6 =i

θjE[AtjAti|Gt− 1 ] +E[Atiηt|Gt− 1 ]

=θi,

where the first equality is the definition ofXt=〈At,θ〉+ηt, the second by linearity
of expectation and the third by the conditional independence of (Ati)iand the
fact thatE[Ati|Gt− 1 ] = 0 andE[A^2 ti|Gt− 1 ] = 1. This looks quite promising, but
we should also check the variance. Using our assumptions we have:E[η] = 0 and
E[η^2 ]≤1 and〈a,θ〉≤1 for all actionsawe have


V[AtiXt|Gt− 1 ] =E[A^2 tiXt^2 |Gt− 1 ]−θ^2 i=E[(〈At,θ〉+η)^2 |Gt− 1 ]−θi^2 ≤ 2.
(23.2)

And now we have cause for celebration. The value ofθican be estimated by
choosingAtito be a Rademacher random variable independent of the choices in
other dimensions. All the policy does is treat all dimensions independently. For a
particular dimension (sayi) it explores by choosingAti∈{− 1 , 1 }uniformly at
random until its estimate is sufficiently accurate to commit to eitherAti= 1 or
Ati=−1 for all future rounds. How long this takes depends on|θi|, but note that
if|θi|is small, then the price of exploring is also limited. The policy that results
from this idea is called Selective Explore-Then-Commit (Algorithm 13, SETC).


Theorem23.2.There exists a universal constantsC,C′> 0 such that the regret
of SETC satisfies:


Rn≤ 3 ‖θ‖ 1 +C


i:θi 6 =0

log(n)
|θi|

and Rn≤ 3 ‖θ 1 ‖+C′‖θ‖ 0


nlog(n).

By appealing to the central limit theorem and the variance calculation in
Free download pdf