Bandit Algorithms

24.2 Unit ball 273

ofθis at least

Rn(A,θ) =Eθ

[n ∑

t=1

∑d

i=1

(sign(θi)−Ati)θi

]

≥

√

1

n

∑d

i=1

Eθ

[n ∑

t=1

I{sign(Ati) 6 = sign(θi)}

]

≥

√

n 2

∑d

i=1

Pθ

(n ∑

t=1

I{sign(Ati) 6 = sign(θi)}≥n/ 2

)

=

√

n 2

∑d

i=1

pθi≥

exp(−2) 8 d

√

n,

where the first line follows since the optimal action satisfiesa∗i=sign(θi) for i∈[d], the first inequality follows from a simple case-based analysis showing that (sign(θi)−Ati)θi≥|θi|I{sign(Ati) 6 = sign(θi)}, the second inequality is Markov’s inequality (see Lemma 5.1), and the last inequality follows from the choice of θ.

Except for logarithmic factors this shows that the algorithm of Chapter 19 is near-optimal for this action set. The same proof works whenA={− 1 , 1 }d is restricted to the corners of the hypercube, which is a finite-armed linear bandit. In Chapter 22 we gave a policy with regretRn=O(

√

ndlog(nk)) wherek=|A|. There is no contradiction because the action set in the above proof hask=|A|= 2delements.

24.2 Unit ball

Lower bounding the minimax regret when the action set is the unit ball presents an additional challenge relative to the hypercube. The product structure of the hypercube means that the actions of the learner in one dimension do not constraint their choices in other dimensions. For the unit ball this is not true and this complicates the analysis. Nevertheless, a small modification of the technique allows us to prove a similar bound.

Theorem24.2. Assumed≤ 2 nand letA={x∈Rd:‖x‖ 2 ≤ 1 }. Then there exists a parameter vector θ ∈ Rd with ‖θ‖^22 = d^2 /(48n) such that Rn(A,θ)≥d

√

n/(16

√

3).

Proof Let ∆ = 4 √^13

√

d/nandθ∈{±∆}dand fori∈[d] defineτi=n∧min{t:

Bandit Algorithms

]

≥

√

1

]

≥

√

)

=

√

√

√

24.2 Unit ball

√

√

3).

√

Get our desktop app

Company

Features

Documentation

Resources