11.2 Importance-weighted estimators 145
of the maximum function we have
Rn(π,ν) = max
i∈[k]
E
[n
∑
t=1
(Xti−XtAt)
]
≤E
[
max
i∈[k]
∑n
t=1
(Xti−XtAt)
]
=E[Rn(π,X)]≤R∗n(π), (11.2)
where the regret in the first line is the stochastic regret and in the last it is the
adversarial regret. Therefore the worst-case stochastic regret is upper bounded
by the worst-case adversarial regret. Going the other way, the above inequality
also implies the worst-case regret for adversarial problems is lower bounded by
the worst-case regret on stochastic problems with rewards bounded in [0,1]. In
Chapter 15 we prove the worst-case regret for stochastic bandits is at leastc
√
nk,
wherec >0 is a universal constant. And so for the same universal constant the
minimax regret for adversarial bandits satisfies
R∗n= infπ sup
x∈[0,1]n×k
Rn(π,x)≥c
√
nk.
11.2 Importance-weighted estimators
A key ingredient of all adversarial bandit algorithms is a mechanism for estimating
the reward of unplayed arms. Recall thatPtis the conditional distribution of the
action played in roundtand soPtiis the conditional probability
Pti=P(At=i|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ).
In what follows we assume thatPti>0 almost surely, which is true for all policies
considered in this chapter. Theimportance-weighted estimatorofxtiis
Xˆti=I{At=i}Xt
Pti
. (11.3)
LetEt[·]=E[·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ] denote the conditional expectation given
the history up to timet. The conditional mean ofXˆtisatisfies
Et[Xˆti] =xti, (11.4)
which means thatXˆtiis an unbiased estimate ofxticonditioned on the history
observed aftert−1 rounds. To see why Eq. (11.4) holds, letAti=I{At=i}so
thatXtAti=xtiAtiand
Xˆti=Ati
Pti
xti.
NowEt[Ati] =Ptiand sincePtiisσ(A 1 ,X 1 ,...,At− 1 ,Xt− 1 )-measurable,
Et[Xˆti] =Et
[
Ati
Pti
xti
]
=
xti
Pti
Et[Ati] =
xti
Pti
Pti=xti.