11.4 Regret analysis 148
1:Input: n,k,η
2:SetSˆ 0 i= 0 for alli
3:fort= 1,...,ndo
4: Calculate the sampling distributionPt:
Pti=
exp
(
ηSˆt− 1 ,i
)
∑k
j=1exp
(
ηSˆt− 1 ,j
)
5: SampleAt∼Ptand observe rewardXt
6: CalculateSˆti:
Sˆti=Sˆt− 1 ,i+ 1−I{At=i}(1−Xt)
Pti
7:end for
Algorithm 9:Exp3.
Proof For any armidefine
Rni=
∑n
t=1
xti−E
[n
∑
t=1
Xt
]
,
which is the expected regret relative to using actioniin all the rounds. The
result will follow by boundingRnifor alli, including the optimal arm. For the
remainder of the proof, letibe some fixed arm. By the unbiasedness property of
the importance-weighted estimatorXˆti,
E[Sˆni] =
∑n
t=1
xti and also Et[Xt] =
∑k
i=1
Ptixti=
∑k
i=1
PtiEt[Xˆti]. (11.8)
The tower rule says that for any random variableX,E[Et[X]] =E[X], which
together with the linearity of expectation and Eq. (11.8) means that
Rni=E
[
Sˆni
]
−E
[n
∑
t=1
∑k
i=1
PtiXˆti
]
=E
[
Sˆni−Sˆn
]
, (11.9)
where the last equality serves as the definition ofSˆn=
∑
t
∑
iPtiXˆti. To bound
the right-hand side of Eq. (11.9) let
Wt=
∑k
j=1
exp
(
ηSˆtj
)
.
By convention an empty sum is zero, which means thatS 0 j= 0 andW 0 =k.
Then
exp(ηSˆni)≤
∑k
j=1
exp(ηSˆnj) =Wn=W 0
W 1
W 0 ...
Wn
Wn− 1 =k
∏n
t=1
Wt
Wt− 1. (11.10)