Bandit Algorithms

(Jeff_L) #1
34.4 The Bayesian bandit environment 403

ThenS 1 ,S 2 ,...,Snis a Markov chain with the conditional distribution ofSt+1
givenSta Gaussian with meanSt+μtand variance 1 +σ^2 t.

34.4 The Bayesian bandit environment


The Bayesian bandit model is the same as the frequentist version introduced in
Chapter 4, except that at the beginning of the game an environment is sampled
from the posterior. Of course, the environment is not revealed to the learner, but
its presence forces us to change our conditions on the rewards because the rewards
are dependent on each other through the chosen environment. For simplicity we
treat only the finite,k-armed case, but the more general setup is handled in the
same was as in Chapter 4.
Ak-armed Bayesian bandit environmentis a tuple (E,G,Q,P) where
(E,G) is a measurable space andQis a probability measure on (E,G) called the
prior. The last elementsP= (Pνi:ν∈ E,i∈[k]) is a probability kernel from
E×[k] to (R,B(R)), wherePνiis the reward distribution associated with theith
arm in banditν. A Bayesian bandit environment and policyπ= (πt)nt=1interact
to produce a collection of random variables,ν∈ E, (At)nt=1and (Xt)nt=1with
At∈[k] andXt∈Rthat satisfy

(a)P(ν∈·) =Q(·).
(b)The conditional distribution of actionAtgivenν,A 1 ,X 1 ,...,At− 1 ,Xt− 1 is
πt(·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) almost surely.
(c)The conditional distribution of the rewardXtgivenν,A 1 ,X 1 ,...,AtisPνAt
almost surely.

The existence of a probability space carrying random elements satisfying
these conditions is guaranteed by the Ionescu–Tulcea theorem (Theorem 3.3,
Exercise 34.6). The corresponding probability measure will be denoted byPQPπ.

Most of the structure of a Bayesian bandit environment is inP, which
determines the reward distribution for each armiin banditsν∈E.

Example34.8.Ak-armed Bayesian Bernoulli bandit environment could be
defined by lettingE= [0,1]k,G=B(E) andPνi=B(νi). A natural prior in this
case would be a product of Beta(α,β) distributions:

Q(A) =



A

∏k

i=1

qi(xi)dx,

whereqi(x) =xα−^1 (1−x)β−^1 Γ(α+β)/(Γ(α)Γ(β)).
Free download pdf