Bandit Algorithms

11.4 Regret analysis 148

1:Input: n,k,η 2:SetSˆ 0 i= 0 for alli 3:fort= 1,...,ndo 4: Calculate the sampling distributionPt:

Pti=

exp

(

ηSˆt− 1 ,i

)

∑k j=1exp

(

ηSˆt− 1 ,j

)

5: SampleAt∼Ptand observe rewardXt 6: CalculateSˆti:

Sˆti=Sˆt− 1 ,i+ 1−I{At=i}(1−Xt) Pti 7:end for Algorithm 9:Exp3.

Proof For any armidefine

Rni=

∑n

t=1

xti−E

[n ∑

t=1

Xt

]

,

which is the expected regret relative to using actioniin all the rounds. The
result will follow by boundingRnifor alli, including the optimal arm. For the
remainder of the proof, letibe some fixed arm. By the unbiasedness property of
the importance-weighted estimatorXˆti,

E[Sˆni] =

∑n

t=1

xti and also Et[Xt] =

∑k

i=1

Ptixti=

∑k

i=1

PtiEt[Xˆti]. (11.8)

The tower rule says that for any random variableX,E[Et[X]] =E[X], which
together with the linearity of expectation and Eq. (11.8) means that

Rni=E

[

Sˆni

]

−E

[n ∑

t=1

∑k

i=1

PtiXˆti

]

=E

[

Sˆni−Sˆn

]

, (11.9)

where the last equality serves as the definition ofSˆn=

∑

t

∑

iPtiXˆti. To bound the right-hand side of Eq. (11.9) let

Wt=

∑k

j=1

exp

(

ηSˆtj

)

.

By convention an empty sum is zero, which means thatS 0 j= 0 andW 0 =k.
Then

exp(ηSˆni)≤

∑k

j=1

exp(ηSˆnj) =Wn=W 0

W 1

W 0 ...

Wn Wn− 1 =k

∏n

t=1

Wt Wt− 1. (11.10)

Bandit Algorithms

(

)

(

)

]

,

[

]

−E

]

=E

[

]

, (11.9)

∑

∑

(

)

.

W 1

W 0 ...

Get our desktop app

Company

Features

Documentation

Resources