Bandit Algorithms

(Jeff_L) #1
17.2 Adversarial bandits 207

Now by assumption for anyν∈Ekwe have

Rn(π,ν)≤

∫∞


0

P


( ̄


Rn(π,ν)≥x

)


dx

≤B



n(k−1)

∫∞


0

exp

(


−x^1 /p

)


dx≤B


n(k−1).

Therefore by the Theorem 17.1 there exists a banditν∈Eksuch that

P

(


R ̄n(π,ν)≥B√n(k−1) log

(


1


δ

))


≥P


(


R ̄n(π,ν)≥^1
4

min

{


n,^1
B


n(k−1) log

(


1


4 δ

)})


≥δ ,

which contradicts our assumption and completes the proof.

We suspect there exists a policyπand universal constantB >0 such that
for allν∈Ek,

P

(


R ̄n(π,ν)≥B√knlog

(


1


δ

))


≤δ.

17.2 Adversarial bandits


We now explain how to translate the ideas in the previous section to the adversarial
model. Letπ= (πt)nt=1be a fixed policy and recall that forx∈[0,1]n×kthe
random regret is

Rˆn= max
i∈[k]

∑n

t=1

(xti−xtAt).

LetFxbe the cumulative distribution function of the law ofRˆnwhen policyπ
interacts with the adversarial banditx∈[0,1]n×k.

Theorem17.4.Letc,C > 0 be sufficiently smal l/large universal constants and
k≥ 2 ,n≥ 1 andδ∈(0,1)be such thatn≥Cklog(1/(2δ)). Then there exists a
reward sequencex∈[0,1]n×ksuch that

1 −Fx

(


c


nklog

(


1


2 δ

))


≥δ.

The proof is a bit messy, but is not completely without interest. For the sake of
brevity we explain only the high level ideas and refer you elsewhere for the gory
details. There are two difficulties in translating the arguments in the previous
section to the adversarial model. First, in the adversarial model we need the
Free download pdf