Bandit Algorithms

(Jeff_L) #1
37.5 Policy for easy games 461

At∈


a,bNab, where the union is over pairs of neighboring actions. Removing
degenerate actions can only increase the minimax regret, so to simplify the
notation assume that

[k] =


a,bneighbors

Nab.

LetAbe an arbitrary largest subset of the Pareto optimal actions such thatA
does not contain actions that are duplicates of each other andD= [k]\Abe the
remaining actions. In each roundtthe policy performs four steps as described
below.

Step 1 (Local games)
For eachj∈Athe policy maintains an exponential weights distributionQtjover
A∪D= [k], but supported on the intersection of the neighborhoodNjofjand
A. Fora∈[k] the value ofQtjais

Qtja=

INj∩A(a) exp

(


−η

∑t− 1
s=1Z ̃sja

)



b∈Nj∩Aexp

(


−η

∑t− 1
s=1Z ̃sjb

),


whereη >0 is the learning rate and theZ ̃sjaare estimators of the loss difference
ysa−ysjand will be introduced in step four below. For actionsj∈ Ddefine
Qtja=IA(a)/|A|to be the uniform distribution overA.

Step 2 (Global game)
The next step is to merge the local distributions over small neighborhoods into
a global distribution over [k] =A∪D. A square matrix isright stochasticif
it has positive entries and its rows sum to one. Such ad×dmatrix describes a
homogeneous Markov chain with state-space [d] and rowi∈[d] of the matrix
defines the distribution over the next-states. The following result is all that we
need, the proof of which is left to you (Exercise 37.8).

Lemma37.13.LetQtbe the right stochastic matrix withjth row equal toQtj.
Then there exists a unique distributionP ̃tsuch thatP ̃t>=P ̃t>Qt. Furthermore,
this distribution is supported onA.


The distributionP ̃tis called thestationary distributionof the Markov
chain with transition matrixQt. It is supported onAbecause followingQtnever
transitions to states outside ofA. The only reason to include actions inDis that
we wantP ̃tto be defined over [k] to simplify what follows. Writing out the matrix
multiplication shows that

P ̃tj=


a∈A

P ̃taQtaj, (37.11)

which we use repeatedly in the analysis that follows. In particular, this identity
plays a key role in relating the regret to a weighted sum of ‘local regrets’.
Free download pdf