Bandit Algorithms

(Jeff_L) #1
15.2 Minimax lower bounds 190

Figure 15.1The idea of the minimax lower bound. Given a policy and one environment,
the evil antagonist picks another environment so that the policy will suffer a large regret
in at least one environment

Theorem15.2.Letk > 1 andn≥k− 1. Then for any policyπthere exists a
mean vectorμ∈[0,1]ksuch that


Rn(π,νμ)≥

1


27



(k−1)n.

Sinceνμ∈Ek, it follows that the minimax regret forEkis lower bounded by
the right-hand side of the above display as soon asn≥k−1:

R∗n(Ek)≥

1


27



(k−1)n.

The idea of the proof is illustrated in Fig. 15.1.


Proof Fix a policyπ. Let ∆∈[0, 1 /2] be some constant to be chosen later. As
suggested in Chapter 13 we start with a Gaussian bandit with unit variance
and mean vectorμ= (∆, 0 , 0 ,...,0). This environment andπgive rise to the
distributionPνμ,πon the canonical bandit model (Hn,Fn). For brevity we will
usePμin place ofPνμ,πand expectations underPμwill be denoted byEμ. To
choose the second environment, let

i= argminj> 1 Eμ[Tj(n)].

Since

∑k
j=1Eμ[Tj(n)] =n, it holds thatEμ[Ti(n)]≤n/(k−1). The second bandit
is also Gaussian with unit variance and means

μ′= (∆, 0 , 0 ,..., 0 ,2∆, 0 ,...,0),

where specificallyμ′i= 2∆. Thereforeμj=μ′jexcept at indexiand the optimal
arm inνμis the first arm, while inνμ′armiis optimal. We abbreviatePμ′=Pνμ′,π.

Free download pdf