Bandit Algorithms

(Jeff_L) #1
24.2 Unit ball 273

ofθis at least

Rn(A,θ) =Eθ

[n

t=1

∑d

i=1

(sign(θi)−Ati)θi

]




1


n

∑d

i=1


[n

t=1

I{sign(Ati) 6 = sign(θi)}

]




n
2

∑d

i=1


(n

t=1

I{sign(Ati) 6 = sign(θi)}≥n/ 2

)


=



n
2

∑d

i=1

pθi≥

exp(−2)
8 d


n,

where the first line follows since the optimal action satisfiesa∗i=sign(θi) for
i∈[d], the first inequality follows from a simple case-based analysis showing that
(sign(θi)−Ati)θi≥|θi|I{sign(Ati) 6 = sign(θi)}, the second inequality is Markov’s
inequality (see Lemma 5.1), and the last inequality follows from the choice of
θ.

Except for logarithmic factors this shows that the algorithm of Chapter 19 is
near-optimal for this action set. The same proof works whenA={− 1 , 1 }d
is restricted to the corners of the hypercube, which is a finite-armed linear
bandit. In Chapter 22 we gave a policy with regretRn=O(


ndlog(nk))
wherek=|A|. There is no contradiction because the action set in the above
proof hask=|A|= 2delements.

24.2 Unit ball


Lower bounding the minimax regret when the action set is the unit ball presents
an additional challenge relative to the hypercube. The product structure of
the hypercube means that the actions of the learner in one dimension do not
constraint their choices in other dimensions. For the unit ball this is not true and
this complicates the analysis. Nevertheless, a small modification of the technique
allows us to prove a similar bound.

Theorem24.2. Assumed≤ 2 nand letA={x∈Rd:‖x‖ 2 ≤ 1 }. Then
there exists a parameter vector θ ∈ Rd with ‖θ‖^22 = d^2 /(48n) such that
Rn(A,θ)≥d


n/(16


3).


Proof Let ∆ = 4 √^13


d/nandθ∈{±∆}dand fori∈[d] defineτi=n∧min{t:
Free download pdf