37.5 Policy for easy games 463
whereZˆtjais an unbiased estimator ofyta−ytjandβtjais a bias term:
Zˆtja=
P ̃tjvaj(At,Φt)
PtAt
and βtja=ηV^2
∑
b∈Naj
P ̃tj^2
Ptb
. (37.12)
The four steps described so far are summarized in Algorithm 25 below.
1:Input L, Φ,η,γ
2:fort∈ 1 ,...,ndo
3: Fora,j∈[k] let
Qtja=IA(j)
INj∩A(a) exp
(
−η
∑t− 1
s=1Z ̃sja
)
∑
b∈Nj∩Aexp
(
−η
∑t− 1
s=1Z ̃sja
)+ID(j)IA(a)
|A|
4: Find distributionP ̃tsuch thatP ̃t>=P ̃t>Qt
5: ComputePt= (1−γ)Redistribute(P ̃t) +γk 1 and sampleAt∼Pt
6: Compute loss-difference estimators for eachj∈Aanda∈Nj∩A.
Zˆtja=
P ̃tjvaj(At,Φt)
PtAt
βtja=ηV^2
∑
b∈Naj
P ̃tj^2
Ptb
(37.13)
Z ̃tja=Zˆtja−βtja
7:end for
8:functionRedistribute(p)
9: q←p
10: fore∈Ddo
11: Finda,bwithe∈Nabandα∈[0,1] such that`e=α`a+ (1−α)`b
12: ca←αqb+(1αq−bα)qaandcb← 1 −caandρ← 21 kmin
{p
a
qaca,
pb
qbcb
}
13: qe←ρcaqa+ρcbqbandqa←(1−ρca)qaandqb←(1−ρcb)qb
14: end for
15: returnq
16:end function
Algorithm 25:NeighborhoodWatch2.
The next theorem bounds the regret of NeighborhoodWatch2 with high
probability for locally observable games.
Theorem37.15.LetRˆnbe the random regret
Rˆn= max
b∈[k]
∑n
t=1
〈`At−`b,ut〉.