37.8 Notes 469
thenR∗n(G) = Ω(n). All that remains is to prove that(a)ifGhas no neighboring
actions, thenR∗n(G) = 0 and(b)ifGis globally observable, but not locally
observable, thenR∗n(G) =O(n^2 /^3 ).
Theorem37.20.IfGhas no neighboring actions, thenR∗n(G) = 0.
Proof SinceGhas no neighboring actions, there exists an actionasuch that
Ca=Pd− 1 and the policy that choosesAt=afor all rounds suffers no regret.
Theorem37.21.IfGis globally observable, thenR∗n(G) =O(n^2 /^3 ).
Proof sketch LetA⊆[k] be the set of Pareto optimal actions anda◦∈A. Use
the definition of global observability to show that for eacha∈Athere exists a
functionha: [k]×[F]→Rsuch that
∑k
b=1
ha(b,Φbi) =`ai−`a◦i for alli∈[d].
Then define unbiased loss difference estimator∆ˆta=ha(At,Φt)/PtAt, where
Pta= (1−γ)
IA(a) exp
(
−η
∑t− 1
s=1∆ˆsa
)
∑
b∈Aexp
(
−η
∑t− 1
s=1∆ˆsb
)+
γ
k
The result is completed by repeating the standard analysis of the exponential
weights algorithm (or mirror descent with negentropy potential) and optimizing
γandη.
37.8 Notes
1 Astochastic partial monitoring problemis specified by a (Θ×A,F⊗G)→
(Y×R,H⊗B(R)) probability kernel (Pθ,a)θ∈Θ,a∈A. The environment chooses
θ∈Θ and the learner choosesA 1 ,A 2 ,··· ∈ Aand observesY 1 ,Y 2 ,... in a
sequential manner, where (Yt,Rt)∼Pθ,At(·). The rewardRtof roundtis
unobserved. The learner’s goal is to maximize the total expected reward, or,
equivalently, to minimize regret. In the literature, the best studied case is when
Θ is finite (finite stochastic partial monitoring), which is the closest to the
setting studied in this chapter.
2 A nonempty setL⊆Rnis alinear subspaceofRnifαv+βw∈Lfor
allα,β ∈Randv,w∈L. IfLandMare linear subspaces ofRn, then
L⊕M={v+w:L∈L,w∈M}. Theorthogonal complementof linear
subspaceLisL⊥ ={v ∈Rn :〈u,v〉= 0for allu∈L}. The following
properties are easily checked:(i)L⊥is a linear subspace, (ii)(L⊥)⊥=Land
(iii)(L∩M)⊥=L⊥⊕M⊥.
3 LetA∈Rm×nbe a matrix and recall that matrices of this form correspond
to linear maps fromRn→Rmwhere the functionA:Rn→Rmis given by