11.4 Regret analysis 151
− 0. 5 0 0. 5
− 0. 1
0
0. 1
x
exp(x)−(1 +x)
exp(x)−(1 +x+x^2 )
exp(x)−(1 +x+x^2 /2)
Figure 11.2Approximations for exp(x) on [− 1 / 2 , 1 /2].
Theorem11.2.Letx∈[0,1]n×kbe an adversarial bandit andπbe the policy of
Exp3 with learning rateη=
√
2 log(k)/(nk). Then
Rn(π,x)≤
√
2 nklog(k).
Proof By constructionXˆtj≤1. Therefore
exp
(
ηXˆtj
)
= exp(η) exp
(
η(Xˆtj−1)
)
≤exp(η)
{
1 +η(Xˆtj−1) +η
2
2
(Xˆtj−1)^2
}
.
Using the fact that
∑
jPtj= 1 and the inequality 1 +x≤exp(x) we get
Wt
Wt− 1 =
∑k
j=1
Ptjexp(ηXˆtj)≤exp
η
∑k
j=1
PtjXˆtj+
η^2
2
∑k
j=1
Ptj(Xˆtj−1)^2
,
where the equality is from Eq. (11.11). We see that here we need to bound∑
jPtj(Xˆtj−1)^2. LetYˆtj= 1−Xˆtj. Then
Ptj(Xˆtj−1)^2 =PtjYˆtjYˆtj=I{At=j}ytjYˆtj≤Yˆtj,
where the last inequality usedYˆtj≥0 andytj≤1. Thus,
∑k
j=1
Ptj(Xˆtj−1)^2 ≤
∑k
j=1
Yˆtj.
With the same calculations as before, we get
Sˆni−Sˆn≤log(k)
η
+η
2
∑n
t=1
∑k
j=1
Yˆtj. (11.15)
The result is completed by taking expectations of both sides, usingE
∑
t,jYˆtj=
E
∑
t,jEtYˆtj=E
∑
t,jytj≤nkand substituting the learning rate.