Bandit Algorithms

15.2 Minimax lower bounds 190

Figure 15.1The idea of the minimax lower bound. Given a policy and one environment, the evil antagonist picks another environment so that the policy will suffer a large regret in at least one environment

Theorem15.2.Letk > 1 andn≥k− 1. Then for any policyπthere exists a
mean vectorμ∈[0,1]ksuch that

Rn(π,νμ)≥

1

27

√

(k−1)n.

Sinceνμ∈Ek, it follows that the minimax regret forEkis lower bounded by the right-hand side of the above display as soon asn≥k−1:

R∗n(Ek)≥

1

27

√

(k−1)n.

The idea of the proof is illustrated in Fig. 15.1.

Proof Fix a policyπ. Let ∆∈[0, 1 /2] be some constant to be chosen later. As suggested in Chapter 13 we start with a Gaussian bandit with unit variance and mean vectorμ= (∆, 0 , 0 ,...,0). This environment andπgive rise to the distributionPνμ,πon the canonical bandit model (Hn,Fn). For brevity we will usePμin place ofPνμ,πand expectations underPμwill be denoted byEμ. To choose the second environment, let

i= argminj> 1 Eμ[Tj(n)].

Since

∑k j=1Eμ[Tj(n)] =n, it holds thatEμ[Ti(n)]≤n/(k−1). The second bandit is also Gaussian with unit variance and means

μ′= (∆, 0 , 0 ,..., 0 ,2∆, 0 ,...,0),

where specificallyμ′i= 2∆. Thereforeμj=μ′jexcept at indexiand the optimal
arm inνμis the first arm, while inνμ′armiis optimal. We abbreviatePμ′=Pνμ′,π.

Bandit Algorithms

1

27

√

1

27

√

Get our desktop app

Company

Features

Documentation

Resources