37.5 Policy for easy games 463whereZˆtjais an unbiased estimator ofyta−ytjandβtjais a bias term:
Zˆtja=
P ̃tjvaj(At,Φt)
PtAtand βtja=ηV^2∑
b∈NajP ̃tj^2
Ptb. (37.12)
The four steps described so far are summarized in Algorithm 25 below.1:Input L, Φ,η,γ
2:fort∈ 1 ,...,ndo
3: Fora,j∈[k] letQtja=IA(j)INj∩A(a) exp(
−η∑t− 1
s=1Z ̃sja)
∑
b∈Nj∩Aexp(
−η∑t− 1
s=1Z ̃sja)+ID(j)IA(a)
|A|4: Find distributionP ̃tsuch thatP ̃t>=P ̃t>Qt
5: ComputePt= (1−γ)Redistribute(P ̃t) +γk 1 and sampleAt∼Pt
6: Compute loss-difference estimators for eachj∈Aanda∈Nj∩A.Zˆtja=
P ̃tjvaj(At,Φt)
PtAtβtja=ηV^2∑
b∈NajP ̃tj^2
Ptb(37.13)
Z ̃tja=Zˆtja−βtja
7:end for
8:functionRedistribute(p)
9: q←p
10: fore∈Ddo
11: Finda,bwithe∈Nabandα∈[0,1] such that`e=α`a+ (1−α)`b
12: ca←αqb+(1αq−bα)qaandcb← 1 −caandρ← 21 kmin{p
a
qaca,pb
qbcb}
13: qe←ρcaqa+ρcbqbandqa←(1−ρca)qaandqb←(1−ρcb)qb
14: end for
15: returnq
16:end function
Algorithm 25:NeighborhoodWatch2.The next theorem bounds the regret of NeighborhoodWatch2 with high
probability for locally observable games.Theorem37.15.LetRˆnbe the random regret
Rˆn= max
b∈[k]∑nt=1〈`At−`b,ut〉.