Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

15 Minimax Lower Bounds


After the short excursion into information theory, let us return to the world
ofk-armed stochastic bandits. In what follows we fix the horizonn >0 and
the number of actionsk >1. This chapter has two components. The first is
an exact calculation of the relative entropy between measures in the canonical
bandit model for a fixed policy and different bandits. In the second component
we prove a minimax lower bound that formalizes the intuitive arguments given
in Chapter 13.

15.1 Relative entropy between bandits


The following result will be used repeatedly. Some generalizations are provided
in the exercises.

Lemma15.1 (Divergence decomposition).Letν= (P 1 ,...,Pk)be the reward
distributions associated with onek-armed bandit, and letν′= (P 1 ′,...,Pk′)be the
reward distributions associated with anotherk-armed bandit. Fix some policyπ
and letPν=PνπandPν′=Pν′πbe the probability measures on the canonical
bandit model (Section 4.6) induced by then-round interconnection ofπandν
(respectively,πandν′). Then

D(Pν,Pν′) =

∑k

i=1

Eν[Ti(n)] D(Pi,Pi′). (15.1)

Proof Assume thatD(Pi,Pi′) < ∞ for alli ∈ [k]. From this it follows
thatPi Pi′. Defineλ=

∑k
i=1Pi+P


i, which is the measure defined by
λ(A) =

∑k
i=1(Pi(A) +P


i(A)) for any measurable setA. Theorem 14.1 shows
that, as long asddPPνν′<+∞,

D(Pν,Pν′) =Eν

[


log

(


dPν
dPν′

)]


.


Recalling thatρis the counting measure over [k], we find that the Radon-Nikodym
derivative ofPνwith respect to the product measure (ρ×λ)nis given in Eq. (4.7)
Free download pdf