Bandit Algorithms

37.8 Notes 469

thenR∗n(G) = Ω(n). All that remains is to prove that(a)ifGhas no neighboring actions, thenR∗n(G) = 0 and(b)ifGis globally observable, but not locally observable, thenR∗n(G) =O(n^2 /^3 ).

Theorem37.20.IfGhas no neighboring actions, thenR∗n(G) = 0.

Proof SinceGhas no neighboring actions, there exists an actionasuch that Ca=Pd− 1 and the policy that choosesAt=afor all rounds suffers no regret.

Theorem37.21.IfGis globally observable, thenR∗n(G) =O(n^2 /^3 ).

Proof sketch LetA⊆[k] be the set of Pareto optimal actions anda◦∈A. Use the definition of global observability to show that for eacha∈Athere exists a functionha: [k]×[F]→Rsuch that ∑k

b=1

ha(b,Φbi) =`ai−`a◦i for alli∈[d].

Then define unbiased loss difference estimator∆ˆta=ha(At,Φt)/PtAt, where

Pta= (1−γ)

IA(a) exp

(

−η

∑t− 1 s=1∆ˆsa

)

∑

b∈Aexp

(

−η

∑t− 1 s=1∆ˆsb

)+

γ k

The result is completed by repeating the standard analysis of the exponential
weights algorithm (or mirror descent with negentropy potential) and optimizing
γandη.

37.8 Notes

1 Astochastic partial monitoring problemis specified by a (Θ×A,F⊗G)→ (Y×R,H⊗B(R)) probability kernel (Pθ,a)θ∈Θ,a∈A. The environment chooses θ∈Θ and the learner choosesA 1 ,A 2 ,··· ∈ Aand observesY 1 ,Y 2 ,... in a sequential manner, where (Yt,Rt)∼Pθ,At(·). The rewardRtof roundtis unobserved. The learner’s goal is to maximize the total expected reward, or, equivalently, to minimize regret. In the literature, the best studied case is when Θ is finite (finite stochastic partial monitoring), which is the closest to the setting studied in this chapter. 2 A nonempty setL⊆Rnis alinear subspaceofRnifαv+βw∈Lfor allα,β ∈Randv,w∈L. IfLandMare linear subspaces ofRn, then L⊕M={v+w:L∈L,w∈M}. Theorthogonal complementof linear subspaceLisL⊥ ={v ∈Rn :〈u,v〉= 0for allu∈L}. The following properties are easily checked:(i)L⊥is a linear subspace, (ii)(L⊥)⊥=Land (iii)(L∩M)⊥=L⊥⊕M⊥. 3 LetA∈Rm×nbe a matrix and recall that matrices of this form correspond to linear maps fromRn→Rmwhere the functionA:Rn→Rmis given by

Bandit Algorithms

(

)

∑

(

)+

37.8 Notes

Get our desktop app

Company

Features

Documentation

Resources