Bandit Algorithms

11.2 Importance-weighted estimators 145

of the maximum function we have

Rn(π,ν) = max i∈[k]

E

[n ∑

t=1

(Xti−XtAt)

]

≤E

[

max i∈[k]

∑n

t=1

(Xti−XtAt)

]

=E[Rn(π,X)]≤R∗n(π), (11.2) where the regret in the first line is the stochastic regret and in the last it is the adversarial regret. Therefore the worst-case stochastic regret is upper bounded by the worst-case adversarial regret. Going the other way, the above inequality also implies the worst-case regret for adversarial problems is lower bounded by the worst-case regret on stochastic problems with rewards bounded in [0,1]. In Chapter 15 we prove the worst-case regret for stochastic bandits is at leastc

√

nk, wherec >0 is a universal constant. And so for the same universal constant the minimax regret for adversarial bandits satisfies R∗n= infπ sup x∈[0,1]n×k

Rn(π,x)≥c

√

nk.

11.2 Importance-weighted estimators

A key ingredient of all adversarial bandit algorithms is a mechanism for estimating the reward of unplayed arms. Recall thatPtis the conditional distribution of the action played in roundtand soPtiis the conditional probability Pti=P(At=i|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ). In what follows we assume thatPti>0 almost surely, which is true for all policies considered in this chapter. Theimportance-weighted estimatorofxtiis

Xˆti=I{At=i}Xt Pti

. (11.3)

LetEt[·]=E[·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ] denote the conditional expectation given the history up to timet. The conditional mean ofXˆtisatisfies Et[Xˆti] =xti, (11.4)

which means thatXˆtiis an unbiased estimate ofxticonditioned on the history observed aftert−1 rounds. To see why Eq. (11.4) holds, letAti=I{At=i}so thatXtAti=xtiAtiand

Xˆti=Ati Pti

xti.

NowEt[Ati] =Ptiand sincePtiisσ(A 1 ,X 1 ,...,At− 1 ,Xt− 1 )-measurable,

Et[Xˆti] =Et

[

Ati Pti

xti

]

=

xti Pti

Et[Ati] =

xti Pti

Pti=xti.

Bandit Algorithms

E

]

≤E

[

]

√

√

11.2 Importance-weighted estimators

. (11.3)

[

]

=

Get our desktop app

Company

Features

Documentation

Resources