37.5 Policy for easy games 462
Step 3 (Redistribution)
NowP ̃tis rebalanced to a new distributionPtfor which duplicate and degenerate
actions inDare played with sufficient probability. This is done iteratively, starting
withPt=P ̃t. Then for eache∈ Dthe algorithm finds actionsa,b∈ Asuch
thate=α
a+ (1−α)`bfor someα∈[0,1], which is possible by Lemma 37.7.
ThenPtis updated so that some of the probability assigned to actionsaandbis
transferred to actione. After mass has been assigned to all degenerate actions
the algorithm incorporates a small amount of fixed exploration. The complete
procedure is given in Algorithm 25. This is done in such a way that the expected
loss of playing according toPtis approximately the same asP ̃t. The next lemma
formalizes the properties ofPtthat will be critical in what follows. The proof is
left to the reader (Exercise 37.9).
Lemma37.14.Assumeγ∈[0, 1 /2], letu∈ Pd− 1 and leta,j∈ Abe arbitrary
neighbors. ThenPt∈Pk− 1 is a probability vector and the fol lowing hold:
(a)Pta≥P ̃ta/ 4.
(b)
∣∣
∣∣
∣
∑k
a=1
(Pta−P ̃ta)〈`a,u〉
∣∣
∣∣
∣
≤γ.
(c)Ptb≥
P ̃tjQtja
4 k
for any non-duplicateb∈Nja.
(d)Pta≥γ/k.
(e)Pte≥
P ̃tj
4 k for anye∈[k]such thate=
j.
Step 4 (Acting and estimating)
By the definition of local observability, for each pair of neighboring actions
a,bthere exists a functionvab: [k]×[F]→Rsatisfying Eq. (37.3) and with
vab(c,f) = 0 wheneverc /∈ Nab. Even thoughais not a neighbor of itself, for
notational convenience we definevaa(c,f) = 0 for allc,f. While the policy will
work for any admissible choice ofvab, the analysis suggests minimizing
V= max
a,b
‖vab‖∞
with the maximum over all pairs of neighbors.
In Exercise 37.12 you will show that if|Nab|= 2, thenvabcan be chosen so
that‖vab‖∞≤1 +Fand that in the worst case this bound is tight. This
result no longer holds for largerNabas discussed in the exercise.
In roundt, the actionAtis chosen at random fromPt. The loss difference
estimators are then computed by
Z ̃tja=Zˆtja−βtja,