23.2 Elimination on the hypercube 262
So learning the optimal action amounts to learning the sign ofθifor each
dimension. A disadvantage of this structure is that in the worst case the sign
of eachθimust be learned independently, which in Chapter 24 we show leads
to a worst-case regret ofRn= Ω(d
√
n). On the positive side, the seperability
means thatθican be estimated in each dimension independently while paying
absolutely no price for this experimentation whenθi= 0. It turns out that this
allows us to design a policy for whichRn=O(‖θ‖ 0
√
n), even without knowing
the value of‖θ‖ 0.
LetGt=σ(A 1 ,X 1 ,...,At,Xt) be theσ-algebra containing information up to
timet−1 (this differs fromFt, which also includes information about the action
chosen). Now suppose that (Ati)di=1are chosen to be conditionally independent
givenGt− 1 and further assume for some specifici∈[d] thatAtiis sampled from
a Rademacher distribution so thatP(Ati= 1|Gt− 1 )=P(Ati=− 1 |Gt− 1 )= 1/2.
Then
E[AtiXt|Gt− 1 ] =E
Ati
∑d
j=1
Atjθj+ηt
=θiE[A^2 ti|Gt− 1 ] +
∑
j 6 =i
θjE[AtjAti|Gt− 1 ] +E[Atiηt|Gt− 1 ]
=θi,
where the first equality is the definition ofXt=〈At,θ〉+ηt, the second by linearity
of expectation and the third by the conditional independence of (Ati)iand the
fact thatE[Ati|Gt− 1 ] = 0 andE[A^2 ti|Gt− 1 ] = 1. This looks quite promising, but
we should also check the variance. Using our assumptions we have:E[η] = 0 and
E[η^2 ]≤1 and〈a,θ〉≤1 for all actionsawe have
V[AtiXt|Gt− 1 ] =E[A^2 tiXt^2 |Gt− 1 ]−θ^2 i=E[(〈At,θ〉+η)^2 |Gt− 1 ]−θi^2 ≤ 2.
(23.2)
And now we have cause for celebration. The value ofθican be estimated by
choosingAtito be a Rademacher random variable independent of the choices in
other dimensions. All the policy does is treat all dimensions independently. For a
particular dimension (sayi) it explores by choosingAti∈{− 1 , 1 }uniformly at
random until its estimate is sufficiently accurate to commit to eitherAti= 1 or
Ati=−1 for all future rounds. How long this takes depends on|θi|, but note that
if|θi|is small, then the price of exploring is also limited. The policy that results
from this idea is called Selective Explore-Then-Commit (Algorithm 13, SETC).
Theorem23.2.There exists a universal constantsC,C′> 0 such that the regret
of SETC satisfies:
Rn≤ 3 ‖θ‖ 1 +C
∑
i:θi 6 =0
log(n)
|θi|
and Rn≤ 3 ‖θ 1 ‖+C′‖θ‖ 0
√
nlog(n).
By appealing to the central limit theorem and the variance calculation in