Bandit Algorithms

(Jeff_L) #1
37.5 Policy for easy games 463

whereZˆtjais an unbiased estimator ofyta−ytjandβtjais a bias term:


Zˆtja=
P ̃tjvaj(At,Φt)
PtAt

and βtja=ηV^2


b∈Naj

P ̃tj^2
Ptb

. (37.12)


The four steps described so far are summarized in Algorithm 25 below.

1:Input L, Φ,η,γ
2:fort∈ 1 ,...,ndo
3: Fora,j∈[k] let

Qtja=IA(j)

INj∩A(a) exp

(


−η

∑t− 1
s=1Z ̃sja

)



b∈Nj∩Aexp

(


−η

∑t− 1
s=1Z ̃sja

)+ID(j)IA(a)
|A|

4: Find distributionP ̃tsuch thatP ̃t>=P ̃t>Qt
5: ComputePt= (1−γ)Redistribute(P ̃t) +γk 1 and sampleAt∼Pt
6: Compute loss-difference estimators for eachj∈Aanda∈Nj∩A.

Zˆtja=
P ̃tjvaj(At,Φt)
PtAt

βtja=ηV^2


b∈Naj

P ̃tj^2
Ptb

(37.13)


Z ̃tja=Zˆtja−βtja
7:end for
8:functionRedistribute(p)
9: q←p
10: fore∈Ddo
11: Finda,bwithe∈Nabandα∈[0,1] such that`e=α`a+ (1−α)`b
12: ca←αqb+(1αq−bα)qaandcb← 1 −caandρ← 21 kmin

{p
a
qaca,

pb
qbcb

}


13: qe←ρcaqa+ρcbqbandqa←(1−ρca)qaandqb←(1−ρcb)qb
14: end for
15: returnq
16:end function
Algorithm 25:NeighborhoodWatch2.

The next theorem bounds the regret of NeighborhoodWatch2 with high
probability for locally observable games.

Theorem37.15.LetRˆnbe the random regret


Rˆn= max
b∈[k]

∑n

t=1

〈`At−`b,ut〉.
Free download pdf