17.2 Adversarial bandits 208
rewards to be bounded in [0,1]. The second difficulty is we now analyse the
adversarial regret rather than the random pseudo-regret. Given a measureQlet
X∈[0,1]n×kand (At)nt=1be a collection of random variables on a probability
space (Ω,F,PQ) such that:
(a)PQ(X∈B) =Q(B) for allB∈B([0,1]n×k).
(b)PQ(At|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) = πt(At|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) almost
surely, whereXs=XtAs.
Then the regret is a random variableRˆn: Ω→Rdefined by
Rˆn= max
i∈[k]
∑n
t=1
(Xti−XtAt).
Suppose we sampleX∈[0,1]n×kfrom distributionQon ([0,1]n×k,B([0,1]k)).
Claim17.5. Suppose thatX∼QwhereQis a measure on[0,1]n×kwith the
Borelσ-algebra and thatEQ[1−FX(u)]≥δ. Then there exists anx∈[0,1]n×k
such that 1 −Fx(u)≥δ.
The next step is to chooseQand argue thatEQ[1−FX(u)]≥δfor sufficiently
largeu. To do this we need a truncated normal distribution. Defining clipping
function
clip[0,1](x) =
1 ifx > 1
0 ifx < 0
x otherwise.
Letσand ∆ be positive constants to be chosen later and (ηt)nt=1a sequence of
independent random variables withηt∼N(1/ 2 ,σ^2 ). For eachi∈[k] letQibe
the distribution ofX∈[0,1]n×kwhere
Xtj=
clip[0,1](ηt+ ∆) ifj= 1
clip[0,1](ηt+ 2∆) ifj=iandi 6 = 1
clip[0,1](ηt) otherwise.
Notice that under anyQifor fixedtthe random variablesXt 1 ,...,Xtkare not
independent, but for fixedjthe random variablesX 1 j,...,Xtjare independent
and identically distributed. LetPibe the law ofX 1 ,A 1 ,...,An,Xnwhen policy
πinteracts with adversarial bandit sampled fromX∼Qi.
Claim17.6.Ifσ > 0 and∆ =σ
√
k− 1
2 n log
( 1
8 δ
)
, then there exists an armisuch
that
PQi(Ti(n)< n/2)≥ 2 δ.
The proof of this claim follows along the same lines as the theorems in the
previous section. All that changes is the calculation of the relative entropy. The
last step is to relateTi(n) to the random regret. In the stochastic model this was