11.4 Regret analysis 151− 0. 5 0 0. 5− 0. 100. 1xexp(x)−(1 +x)
exp(x)−(1 +x+x^2 )
exp(x)−(1 +x+x^2 /2)Figure 11.2Approximations for exp(x) on [− 1 / 2 , 1 /2].Theorem11.2.Letx∈[0,1]n×kbe an adversarial bandit andπbe the policy of
Exp3 with learning rateη=
√
2 log(k)/(nk). Then
Rn(π,x)≤√
2 nklog(k).Proof By constructionXˆtj≤1. Thereforeexp(
ηXˆtj)
= exp(η) exp(
η(Xˆtj−1))
≤exp(η){
1 +η(Xˆtj−1) +η2
2(Xˆtj−1)^2}
.
Using the fact that∑
jPtj= 1 and the inequality 1 +x≤exp(x) we getWt
Wt− 1 =∑kj=1Ptjexp(ηXˆtj)≤exp
η∑kj=1PtjXˆtj+η^2
2∑kj=1Ptj(Xˆtj−1)^2
,
where the equality is from Eq. (11.11). We see that here we need to bound∑
jPtj(Xˆtj−1)^2. LetYˆtj= 1−Xˆtj. Then
Ptj(Xˆtj−1)^2 =PtjYˆtjYˆtj=I{At=j}ytjYˆtj≤Yˆtj,where the last inequality usedYˆtj≥0 andytj≤1. Thus,
∑kj=1Ptj(Xˆtj−1)^2 ≤∑kj=1Yˆtj.With the same calculations as before, we get
Sˆni−Sˆn≤log(k)
η+η
2∑nt=1∑kj=1Yˆtj. (11.15)The result is completed by taking expectations of both sides, usingE
∑
t,jYˆtj=
E∑
t,jEtYˆtj=E∑
t,jytj≤nkand substituting the learning rate.