Bandit Algorithms

37.4 Lower bounds 457

chosen independently at random from the same distribution. To emphasize the randomness we switch to capital letters. Given a partial monitoring game G= (L,Φ) and probability vectoru∈ Pd− 1 the stochastic partial monitoring environment associated withusamples a sequence of independently and identically distributed random variablesI 1 ,...,InwithP(It=i)=uiandUt=eIt. In each roundta policy chooses actionAtand receives feedback Φt= ΦAtIt. The regret is

Rn(π,u) = max a∈[k]

E

[n ∑

t=1

〈`At−`a,Ut〉

]

= max a∈[k]

E

[n ∑

t=1

〈`At−`a,u〉

]

.

The reader should check thatR∗n(G)≥minπmaxu∈Pd− 1 Rn(π,u), which allows
us to restrict our attention to stochastic partial monitoring problems. Given
u,q∈Pd− 1 , letD(u,q) be the relative entropy between categorical distributions
with parametersuandqrespectively:

D(u,q) =

∑k

i=1

uilog

(

ui qi

)

≤

∑k

i=1

(ui−qi)^2 qi

, (37.4)

where the second inequality follows from the fact that for measuresP,Qwe have
D(P,Q)≤χ^2 (P,Q) (see Note 5 in Chapter 13).

Theorem37.10. LetG= (L,Φ)be a globally observable partial monitoring
problem that is not locally observable. Then there exists a constantcG> 0 such
thatR∗n(G)≥cGn^2 /^3.

Proof The proof involves several steps. Roughly, we need to define two alternative
stochastic partial monitoring problems. We then show these environments are
hard to distinguish without playing an action associated with a large loss. Finally
we balance the cost of distinguishing the environments against the linear cost of
playing randomly.

Step 1: Defining the alternatives Leta,bbe a pair neighboring actions that are not locally observable. Then by definitionCa∩Cbis a polytope of dimensiond−2. Letube the centroid of Ca∩Cband ε= min c/∈Nab

〈`c−`a,u〉. (37.5)

The value ofεis well defined, since by global observability ofG, but nonlocal
observability of (a,b) there must exist some actionc /∈Nab. Furthermore, since
c /∈Nabit follows thatε >0. As in the lower bound constructions for stochastic
bandits, we now define two stochastic partial monitoring problems. Since (a,b)
are not locally observable, there does not exist a functionv: [k]×[F]→Rsuch
that for alli∈[d],
∑

c∈Nab

v(c,Φci) =`ai−`bi. (37.6)

Bandit Algorithms

E

]

E

]

.

(

)

≤

, (37.4)

Get our desktop app

Company

Features

Documentation

Resources