Bandit Algorithms

(Jeff_L) #1
17.2 Adversarial bandits 208

rewards to be bounded in [0,1]. The second difficulty is we now analyse the
adversarial regret rather than the random pseudo-regret. Given a measureQlet
X∈[0,1]n×kand (At)nt=1be a collection of random variables on a probability
space (Ω,F,PQ) such that:

(a)PQ(X∈B) =Q(B) for allB∈B([0,1]n×k).
(b)PQ(At|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) = πt(At|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) almost
surely, whereXs=XtAs.


Then the regret is a random variableRˆn: Ω→Rdefined by


Rˆn= max
i∈[k]

∑n

t=1

(Xti−XtAt).

Suppose we sampleX∈[0,1]n×kfrom distributionQon ([0,1]n×k,B([0,1]k)).

Claim17.5. Suppose thatX∼QwhereQis a measure on[0,1]n×kwith the
Borelσ-algebra and thatEQ[1−FX(u)]≥δ. Then there exists anx∈[0,1]n×k
such that 1 −Fx(u)≥δ.


The next step is to chooseQand argue thatEQ[1−FX(u)]≥δfor sufficiently
largeu. To do this we need a truncated normal distribution. Defining clipping
function

clip[0,1](x) =








1 ifx > 1
0 ifx < 0
x otherwise.

Letσand ∆ be positive constants to be chosen later and (ηt)nt=1a sequence of
independent random variables withηt∼N(1/ 2 ,σ^2 ). For eachi∈[k] letQibe
the distribution ofX∈[0,1]n×kwhere

Xtj=








clip[0,1](ηt+ ∆) ifj= 1
clip[0,1](ηt+ 2∆) ifj=iandi 6 = 1
clip[0,1](ηt) otherwise.

Notice that under anyQifor fixedtthe random variablesXt 1 ,...,Xtkare not
independent, but for fixedjthe random variablesX 1 j,...,Xtjare independent
and identically distributed. LetPibe the law ofX 1 ,A 1 ,...,An,Xnwhen policy
πinteracts with adversarial bandit sampled fromX∼Qi.

Claim17.6.Ifσ > 0 and∆ =σ


k− 1
2 n log

( 1


8 δ

)


, then there exists an armisuch
that


PQi(Ti(n)< n/2)≥ 2 δ.

The proof of this claim follows along the same lines as the theorems in the
previous section. All that changes is the calculation of the relative entropy. The
last step is to relateTi(n) to the random regret. In the stochastic model this was
Free download pdf