Bandit Algorithms

(Jeff_L) #1
34.5 Posterior distributions in bandits 404

34.5 Posterior distributions in bandits


Let (E,G,Q,P) be ak-armed Bayesian bandit environment. Assuming that (E,G)
is a Borel space, then the posterior aftertobservations is a probability kernel
Q(·|·) from the space of histories to (E,G) so that

Q(A|a 1 ,x 1 ,...,at,xt)

is a regular version ofE[IA(ν)|A 1 ,X 1 ,...,At,Xt]. Theorem 34.1 guarantees the
existence of the posterior as long as (E,G) is a Borel space, but the abstract
definition is not very useful for explicit calculations. By making mild assumptions
the posterior can be written in terms of densities. Assume there exists aσ-finite
measureλon (R,B(R)) such thatPνiλfor alli∈[k] andν∈E. Recall from
Chapter 15 that the Radon-Nikodym derivative ofPνπwith respect to (ρ×λ)nis

pνπ(a 1 ,x 1 ,...,an,xn) =

∏n

t=1

πt(at|a 1 ,x 1 ,...,at− 1 xt− 1 )pνat(xt), (34.5)

wherepνais the density ofPνawith respect toλ. Then the posterior aftert
rounds is given by

Q(B|a 1 ,x 1 ,...,at,xt) =


∫Bpνπ(a^1 ,x^1 ,...,at,xt)dQ(ν)
Epνπ(a^1 ,x^1 ,...,at,xt)dQ(ν)

=


B

∏t
∫ s=1pνas(xs)dQ(ν)
E

∏t
s=1pνas(xs)dQ(ν)

, (34.6)


where the second equality follows from Eq. (34.5). The posterior is not
defined when the denominator is zero, which only occurs with probability zero
(Exercise 34.9). Note that the Radon-Nikodym derivativespνa(x) are only unique
up to sets ofPνa-measure zero and so the ‘choice’ of posterior has been converted
to a choice of the Radon-Nikodym derivatives, which in all practical situations
is straightforward. Observe also that Eq. (34.6) is only well defined ifpνas(·) is
G-measurable as a function ofν. Fortunately this is always possible (see Note 7).

Example34.9.The posterior for the Bayesian bandit in Example 34.8 in terms
of its density with respect to the Lebesgue measure is:

q(θ|a︸ 1 ,x 1 ,...,a︷︷ t,x︸t
ht

)∝


∏k

i=1

θiα+si(ht)−^1 (1−θi)β+ti(ht)−si(ht)−^1 ,

wheresi(ht) =

∑t
u=1xuI{au=i}andti(ht) =

∑t
u=1I{au=i}. This means the
posterior is also the product of Beta distributions, each updated according to the
observations from the relevant arm.
Free download pdf