Bandit Algorithms

(Jeff_L) #1
11.2 Importance-weighted estimators 145

of the maximum function we have

Rn(π,ν) = max
i∈[k]

E


[n

t=1

(Xti−XtAt)

]


≤E


[


max
i∈[k]

∑n

t=1

(Xti−XtAt)

]


=E[Rn(π,X)]≤R∗n(π), (11.2)
where the regret in the first line is the stochastic regret and in the last it is the
adversarial regret. Therefore the worst-case stochastic regret is upper bounded
by the worst-case adversarial regret. Going the other way, the above inequality
also implies the worst-case regret for adversarial problems is lower bounded by
the worst-case regret on stochastic problems with rewards bounded in [0,1]. In
Chapter 15 we prove the worst-case regret for stochastic bandits is at leastc


nk,
wherec >0 is a universal constant. And so for the same universal constant the
minimax regret for adversarial bandits satisfies
R∗n= infπ sup
x∈[0,1]n×k

Rn(π,x)≥c


nk.

11.2 Importance-weighted estimators


A key ingredient of all adversarial bandit algorithms is a mechanism for estimating
the reward of unplayed arms. Recall thatPtis the conditional distribution of the
action played in roundtand soPtiis the conditional probability
Pti=P(At=i|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ).
In what follows we assume thatPti>0 almost surely, which is true for all policies
considered in this chapter. Theimportance-weighted estimatorofxtiis

Xˆti=I{At=i}Xt
Pti

. (11.3)


LetEt[·]=E[·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ] denote the conditional expectation given
the history up to timet. The conditional mean ofXˆtisatisfies
Et[Xˆti] =xti, (11.4)

which means thatXˆtiis an unbiased estimate ofxticonditioned on the history
observed aftert−1 rounds. To see why Eq. (11.4) holds, letAti=I{At=i}so
thatXtAti=xtiAtiand

Xˆti=Ati
Pti

xti.

NowEt[Ati] =Ptiand sincePtiisσ(A 1 ,X 1 ,...,At− 1 ,Xt− 1 )-measurable,

Et[Xˆti] =Et

[


Ati
Pti

xti

]


=


xti
Pti

Et[Ati] =

xti
Pti

Pti=xti.
Free download pdf