15.2 Minimax lower bounds 190Figure 15.1The idea of the minimax lower bound. Given a policy and one environment,
the evil antagonist picks another environment so that the policy will suffer a large regret
in at least one environmentTheorem15.2.Letk > 1 andn≥k− 1. Then for any policyπthere exists a
mean vectorμ∈[0,1]ksuch that
Rn(π,νμ)≥1
27
√
(k−1)n.Sinceνμ∈Ek, it follows that the minimax regret forEkis lower bounded by
the right-hand side of the above display as soon asn≥k−1:R∗n(Ek)≥1
27
√
(k−1)n.The idea of the proof is illustrated in Fig. 15.1.
Proof Fix a policyπ. Let ∆∈[0, 1 /2] be some constant to be chosen later. As
suggested in Chapter 13 we start with a Gaussian bandit with unit variance
and mean vectorμ= (∆, 0 , 0 ,...,0). This environment andπgive rise to the
distributionPνμ,πon the canonical bandit model (Hn,Fn). For brevity we will
usePμin place ofPνμ,πand expectations underPμwill be denoted byEμ. To
choose the second environment, leti= argminj> 1 Eμ[Tj(n)].Since∑k
j=1Eμ[Tj(n)] =n, it holds thatEμ[Ti(n)]≤n/(k−1). The second bandit
is also Gaussian with unit variance and meansμ′= (∆, 0 , 0 ,..., 0 ,2∆, 0 ,...,0),where specificallyμ′i= 2∆. Thereforeμj=μ′jexcept at indexiand the optimal
arm inνμis the first arm, while inνμ′armiis optimal. We abbreviatePμ′=Pνμ′,π.