Bandit Algorithms

(Jeff_L) #1
11.4 Regret analysis 148

1:Input: n,k,η
2:SetSˆ 0 i= 0 for alli
3:fort= 1,...,ndo
4: Calculate the sampling distributionPt:

Pti=

exp

(


ηSˆt− 1 ,i

)


∑k
j=1exp

(


ηSˆt− 1 ,j

)


5: SampleAt∼Ptand observe rewardXt
6: CalculateSˆti:

Sˆti=Sˆt− 1 ,i+ 1−I{At=i}(1−Xt)
Pti
7:end for
Algorithm 9:Exp3.

Proof For any armidefine

Rni=

∑n

t=1

xti−E

[n

t=1

Xt

]


,


which is the expected regret relative to using actioniin all the rounds. The
result will follow by boundingRnifor alli, including the optimal arm. For the
remainder of the proof, letibe some fixed arm. By the unbiasedness property of
the importance-weighted estimatorXˆti,


E[Sˆni] =

∑n

t=1

xti and also Et[Xt] =

∑k

i=1

Ptixti=

∑k

i=1

PtiEt[Xˆti]. (11.8)

The tower rule says that for any random variableX,E[Et[X]] =E[X], which
together with the linearity of expectation and Eq. (11.8) means that


Rni=E

[


Sˆni

]


−E


[n

t=1

∑k

i=1

PtiXˆti

]


=E


[


Sˆni−Sˆn

]


, (11.9)


where the last equality serves as the definition ofSˆn=



t


iPtiXˆti. To bound
the right-hand side of Eq. (11.9) let

Wt=

∑k

j=1

exp

(


ηSˆtj

)


.


By convention an empty sum is zero, which means thatS 0 j= 0 andW 0 =k.
Then


exp(ηSˆni)≤

∑k

j=1

exp(ηSˆnj) =Wn=W 0

W 1


W 0 ...


Wn
Wn− 1 =k

∏n

t=1

Wt
Wt− 1. (11.10)
Free download pdf