15.2 Minimax lower bounds 190
Figure 15.1The idea of the minimax lower bound. Given a policy and one environment,
the evil antagonist picks another environment so that the policy will suffer a large regret
in at least one environment
Theorem15.2.Letk > 1 andn≥k− 1. Then for any policyπthere exists a
mean vectorμ∈[0,1]ksuch that
Rn(π,νμ)≥
1
27
√
(k−1)n.
Sinceνμ∈Ek, it follows that the minimax regret forEkis lower bounded by
the right-hand side of the above display as soon asn≥k−1:
R∗n(Ek)≥
1
27
√
(k−1)n.
The idea of the proof is illustrated in Fig. 15.1.
Proof Fix a policyπ. Let ∆∈[0, 1 /2] be some constant to be chosen later. As
suggested in Chapter 13 we start with a Gaussian bandit with unit variance
and mean vectorμ= (∆, 0 , 0 ,...,0). This environment andπgive rise to the
distributionPνμ,πon the canonical bandit model (Hn,Fn). For brevity we will
usePμin place ofPνμ,πand expectations underPμwill be denoted byEμ. To
choose the second environment, let
i= argminj> 1 Eμ[Tj(n)].
Since
∑k
j=1Eμ[Tj(n)] =n, it holds thatEμ[Ti(n)]≤n/(k−1). The second bandit
is also Gaussian with unit variance and means
μ′= (∆, 0 , 0 ,..., 0 ,2∆, 0 ,...,0),
where specificallyμ′i= 2∆. Thereforeμj=μ′jexcept at indexiand the optimal
arm inνμis the first arm, while inνμ′armiis optimal. We abbreviatePμ′=Pνμ′,π.