Bandit Algorithms

37.5 Policy for easy games 461

At∈

⋃

a,bNab, where the union is over pairs of neighboring actions. Removing degenerate actions can only increase the minimax regret, so to simplify the notation assume that

[k] =

⋃

a,bneighbors

Nab.

LetAbe an arbitrary largest subset of the Pareto optimal actions such thatA does not contain actions that are duplicates of each other andD= [k]\Abe the remaining actions. In each roundtthe policy performs four steps as described below.

Step 1 (Local games) For eachj∈Athe policy maintains an exponential weights distributionQtjover A∪D= [k], but supported on the intersection of the neighborhoodNjofjand A. Fora∈[k] the value ofQtjais

Qtja=

INj∩A(a) exp

(

−η

∑t− 1 s=1Z ̃sja

)

∑

b∈Nj∩Aexp

(

−η

∑t− 1 s=1Z ̃sjb

),

whereη >0 is the learning rate and theZ ̃sjaare estimators of the loss difference ysa−ysjand will be introduced in step four below. For actionsj∈ Ddefine Qtja=IA(a)/|A|to be the uniform distribution overA.

Step 2 (Global game) The next step is to merge the local distributions over small neighborhoods into a global distribution over [k] =A∪D. A square matrix isright stochasticif it has positive entries and its rows sum to one. Such ad×dmatrix describes a homogeneous Markov chain with state-space [d] and rowi∈[d] of the matrix defines the distribution over the next-states. The following result is all that we need, the proof of which is left to you (Exercise 37.8).

Lemma37.13.LetQtbe the right stochastic matrix withjth row equal toQtj.
Then there exists a unique distributionP ̃tsuch thatP ̃t>=P ̃t>Qt. Furthermore,
this distribution is supported onA.

The distributionP ̃tis called thestationary distributionof the Markov chain with transition matrixQt. It is supported onAbecause followingQtnever transitions to states outside ofA. The only reason to include actions inDis that we wantP ̃tto be defined over [k] to simplify what follows. Writing out the matrix multiplication shows that

P ̃tj=

∑

a∈A

P ̃taQtaj, (37.11)

which we use repeatedly in the analysis that follows. In particular, this identity plays a key role in relating the regret to a weighted sum of ‘local regrets’.

Bandit Algorithms

⋃

⋃

(

)

∑

(

),

∑

Get our desktop app

Company

Features

Documentation

Resources