Bandit Algorithms

37.5 Policy for easy games 463

whereZˆtjais an unbiased estimator ofyta−ytjandβtjais a bias term:

Zˆtja= P ̃tjvaj(At,Φt) PtAt

and βtja=ηV^2

∑

b∈Naj

P ̃tj^2 Ptb

. (37.12)

The four steps described so far are summarized in Algorithm 25 below.

1:Input L, Φ,η,γ 2:fort∈ 1 ,...,ndo 3: Fora,j∈[k] let

Qtja=IA(j)

INj∩A(a) exp

(

−η

∑t− 1 s=1Z ̃sja

)

∑

b∈Nj∩Aexp

(

−η

∑t− 1 s=1Z ̃sja

)+ID(j)IA(a) |A|

4: Find distributionP ̃tsuch thatP ̃t>=P ̃t>Qt 5: ComputePt= (1−γ)Redistribute(P ̃t) +γk 1 and sampleAt∼Pt 6: Compute loss-difference estimators for eachj∈Aanda∈Nj∩A.

Zˆtja= P ̃tjvaj(At,Φt) PtAt

βtja=ηV^2

∑

b∈Naj

P ̃tj^2 Ptb

(37.13)

Z ̃tja=Zˆtja−βtja 7:end for 8:functionRedistribute(p) 9: q←p 10: fore∈Ddo 11: Finda,bwithe∈Nabandα∈[0,1] such that`e=α`a+ (1−α)`b 12: ca←αqb+(1αq−bα)qaandcb← 1 −caandρ← 21 kmin

{p a qaca,

pb qbcb

}

13: qe←ρcaqa+ρcbqbandqa←(1−ρca)qaandqb←(1−ρcb)qb 14: end for 15: returnq 16:end function Algorithm 25:NeighborhoodWatch2.

The next theorem bounds the regret of NeighborhoodWatch2 with high probability for locally observable games.

Theorem37.15.LetRˆnbe the random regret

Rˆn= max b∈[k]

∑n

t=1

〈`At−`b,ut〉.

Bandit Algorithms

∑

. (37.12)

(

)

∑

(

∑

(37.13)

}

Get our desktop app

Company

Features

Documentation

Resources