Bandit Algorithms

17.2 Adversarial bandits 207

Now by assumption for anyν∈Ekwe have

Rn(π,ν)≤

∫∞

0

P

( ̄

Rn(π,ν)≥x

)

dx

≤B

√

n(k−1)

∫∞

0

exp

(

−x^1 /p

)

dx≤B

√

n(k−1).

Therefore by the Theorem 17.1 there exists a banditν∈Eksuch that

P

(

R ̄n(π,ν)≥B√n(k−1) log

(

1

δ

))

≥P

(

R ̄n(π,ν)≥^1 4

min

{

n,^1 B

√

n(k−1) log

(

1

4 δ

)})

≥δ ,

which contradicts our assumption and completes the proof.

We suspect there exists a policyπand universal constantB >0 such that for allν∈Ek,

P

(

R ̄n(π,ν)≥B√knlog

(

1

δ

))

≤δ.

17.2 Adversarial bandits

We now explain how to translate the ideas in the previous section to the adversarial model. Letπ= (πt)nt=1be a fixed policy and recall that forx∈[0,1]n×kthe random regret is

Rˆn= max i∈[k]

∑n

t=1

(xti−xtAt).

LetFxbe the cumulative distribution function of the law ofRˆnwhen policyπ interacts with the adversarial banditx∈[0,1]n×k.

Theorem17.4.Letc,C > 0 be sufficiently smal l/large universal constants and k≥ 2 ,n≥ 1 andδ∈(0,1)be such thatn≥Cklog(1/(2δ)). Then there exists a reward sequencex∈[0,1]n×ksuch that

1 −Fx

(

c

√

nklog

(

1

2 δ

))

≥δ.

The proof is a bit messy, but is not completely without interest. For the sake of brevity we explain only the high level ideas and refer you elsewhere for the gory details. There are two difficulties in translating the arguments in the previous section to the adversarial model. First, in the adversarial model we need the

Bandit Algorithms

∫∞

P

( ̄

)

≤B

√

∫∞

(

)

√

(

(

1

))

≥P

(

{

√

(

1

)})

(

(

1

))

17.2 Adversarial bandits

(

√

(

1

))

Get our desktop app

Company

Features

Documentation

Resources