Bandit Algorithms

37.4 Lower bounds 458

In this form it does not seem obvious what the next step should be. To clear
things up we introduce some linear algebra. LetSc∈ { 0 , 1 }F×dbe the matrix
with (Sc)fi=I{Φci=f}, which is chosen so thatScei=eΦci. Define the linear
mapS:Rd→R|Nab|Fby

S=







Sa Sb .. . Sc





,

which is the matrix formed by stacking the matrices{Sc:c∈Nab}. Then there
exists avsatisfying Eq. (37.6) if and only if there exists aw∈R|Nab|Fsuch that

(`a−`b)>=w>S.

In other words, actions (a,b) are locally observable if and only ifa−b∈im(S>).
Since we have assumed that (a,b) are not locally observable, it means that
a−b∈/im(S>). Letz∈im(S>) andw∈ker(S) be such thata−b=z+w,
which is possible sinceim(S>)⊕ker(S) =Rd. Sincea−b∈/im(S>) it holds that
w 6 = 0 and〈a−b,w〉=〈z+w,w〉=〈w,w〉6= 0. Finally letq=w/〈a−b,w〉.
To summarize, we have demonstrated the existence of a vectorq∈Rd,q 6 = 0
such thatSq= 0 and〈a−b,q〉= 1. Let ∆>0 be some small constant to be
tuned subsequently and defineua=u−∆qandub=u+ ∆qso that

〈`b−`a,ua〉= ∆ and 〈`a−`b,ub〉= ∆. (37.7)

We note that if ∆ is sufficiently small, thenua∈Caandub∈Cb. This means
that actionais optimal if the environment playsuaon average andbis optimal
if the environment playsubon average (see Fig. 37.3).

Step 2: Calculating the relative entropy Given actioncandw∈Pd− 1 letPcwbe the distribution on the feedback observed by the learner when playing actioncin stochastic partial monitoring environment determined byw. That isPcw(f) =Pw(Φt=f|At=c) = (Scw)f. Further, let Pwbe the distribution on the historiesHn= (A 1 ,Φ 1 ,...,An,Φn) arising from the interaction of the learner’s policy with the stochastic environment determined byw. A modification of Lemma 15.1 shows that

D(Pua,Pub) =

∑

c∈[k]

E[Tc(n)] D(Pcua,Pcub), (37.8)

By the definitions ofuaandub, we haveScua=Scubfor allc∈Nab. Therefore Pcua=PcubandD(Pcua,Pcub) = 0 for allc∈Nab. On the other hand, ifc /∈Nab, then by Eq. (37.4),

D(Pcua,Pcub)≤D(ua,ub)≤

∑d

i=1

(uai−ubi)^2 ubi

= 4∆^2

∑k

i=1

qi^2 ui+ ∆qi

≤Cu∆^2 ,

Bandit Algorithms

S=











,

∑

= 4∆^2

Get our desktop app

Company

Features

Documentation

Resources