Bandit Algorithms

17.2 Adversarial bandits 208

rewards to be bounded in [0,1]. The second difficulty is we now analyse the adversarial regret rather than the random pseudo-regret. Given a measureQlet X∈[0,1]n×kand (At)nt=1be a collection of random variables on a probability space (Ω,F,PQ) such that:

(a)PQ(X∈B) =Q(B) for allB∈B([0,1]n×k).
(b)PQ(At|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) = πt(At|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) almost
surely, whereXs=XtAs.

Then the regret is a random variableRˆn: Ω→Rdefined by

Rˆn= max i∈[k]

∑n

t=1

(Xti−XtAt).

Suppose we sampleX∈[0,1]n×kfrom distributionQon ([0,1]n×k,B([0,1]k)).

Claim17.5. Suppose thatX∼QwhereQis a measure on[0,1]n×kwith the
Borelσ-algebra and thatEQ[1−FX(u)]≥δ. Then there exists anx∈[0,1]n×k
such that 1 −Fx(u)≥δ.

The next step is to chooseQand argue thatEQ[1−FX(u)]≥δfor sufficiently largeu. To do this we need a truncated normal distribution. Defining clipping function

clip[0,1](x) =







1 ifx > 1 0 ifx < 0 x otherwise.

Letσand ∆ be positive constants to be chosen later and (ηt)nt=1a sequence of independent random variables withηt∼N(1/ 2 ,σ^2 ). For eachi∈[k] letQibe the distribution ofX∈[0,1]n×kwhere

Xtj=







clip[0,1](ηt+ ∆) ifj= 1 clip[0,1](ηt+ 2∆) ifj=iandi 6 = 1 clip[0,1](ηt) otherwise.

Notice that under anyQifor fixedtthe random variablesXt 1 ,...,Xtkare not independent, but for fixedjthe random variablesX 1 j,...,Xtjare independent and identically distributed. LetPibe the law ofX 1 ,A 1 ,...,An,Xnwhen policy πinteracts with adversarial bandit sampled fromX∼Qi.

Claim17.6.Ifσ > 0 and∆ =σ

√

k− 1 2 n log

( 1

8 δ

)

, then there exists an armisuch
that

PQi(Ti(n)< n/2)≥ 2 δ.

The proof of this claim follows along the same lines as the theorems in the previous section. All that changes is the calculation of the relative entropy. The last step is to relateTi(n) to the random regret. In the stochastic model this was

Bandit Algorithms













√

( 1

)

Get our desktop app

Company

Features

Documentation

Resources