Bandit Algorithms

(Jeff_L) #1
12.3 Notes 163

Proof of Lemma 12.2 Fixt∈[n] and letEt[·]=E[·|Ft] denote the conditional
expectation with respect toFt. By Lemma 12.5 and the assumption that
0 ≤αtiY ̃ti≤ 2 λtiwe have

exp

(


αtiY ̃ti
1 +λti

)


≤(1 +αtiY ̃ti).

Taking the product of these inequalities overi,

Et− 1

[


exp

(k

i=1

αtiY ̃ti
1 +λti

)]


≤Et− 1

[k

i=1

(1 +αtiY ̃ti)

]


≤1 +Et− 1

[k

i=1

αtiY ̃ti

]


= 1 +


∑k

i=1

αtiyti≤exp

(k

i=1

αtiyti

)


, (12.10)


where the second inequality follows from the assumption that forS⊂[k] with
|S|>1,Et− 1


i∈SY ̃ti≤0, the third one follows from the assumption that
Et− 1 Y ̃ti=yti, while the last one follows from 1 +x≤exp(x). Define

Zt= exp

(k

i=1

αti

( ̃


Yti
1 +λti

−yti

))


and letMt=Z 1 ...Zt,t∈[n] withM 0 = 1. By(12.10),Et− 1 [Zt]≤1. Therefore
E[Mt] =E[Et− 1 [Mt]] =E[Mt− 1 Et− 1 [Zt]]≤E[Mt− 1 ]≤···≤E[M 0 ] = 1.
Settingt=nand combining the above display with Markov’s inequality leads to
P(log(Mn)≥log(1/δ)) =P(Mnδ≥1)≤E[Mn]δ≤δ.

12.3 Notes


1 An upper bound on the expected regret of Exp3-IX can be obtained by
integrating the tail.

Rn≤E[(Rˆn)+] =

∫∞


0

P


(


(Rˆn)+≥x

)


dx≤

∫∞


0

P


(


Rˆn≥x

)


dx,

where the first equality follows from Proposition 2.8. The result is completed
using either the high probability bound in Theorem 12.1 and by straightforward
integration. We leave the details to the reader in Exercise 12.4.
2 The analysis presented here uses a fixed learning rate that depends on the
horizon. Replacingηandγwithηt=


log(k)/(kt)andγt=ηt/2 leads to an
anytime algorithm with about the same regret [Koc ́ak et al., 2014, Neu, 2015a].
3 There is another advantage of the modified importance-weighted estimators
used by Exp3-IX, which leads to an improved regret in the special case that
one of the arms has small losses. Specifically, it is possible to show that

Rn=O

(√


kmin
i∈[k]

Linlog(k)

)


.

Free download pdf