Bandit Algorithms

4.6 The canonical bandit model ( ) 63

the measure in terms of a distribution. Letpi(0) =Pi({ 0 }) andpi(1) = 1−pi(0) and define

pνπ(a 1 ,x 1 ,...,an,xn) =

∏n

t=1

π(at|a 1 ,x 1 ,...,at− 1 ,xt− 1 )pat(xt).

The reader can check thatpνπis a distribution on ([k]×{ 0 , 1 })nand that the
associated measure satisfies (a) and (b) above. Making this argument rigorous
when (Pi) are not discrete requires the use of Radon-Nikodym derivatives. Letλ
be aσ-finite measure on (R,B(R)) for whichPiis absolutely continuous with
respect toλfor alli. Next letpi=dPi/dλbe the Radon-Nikodym derivative of
Piwith respect toλ, which is a functionpi:R→Rsuch that

∫

Bpidλ=Pi(B) for allB∈B(R). Lettingρbe the counting measure withρ(B) =|B|, the density pνπ: Ω→Rcan now be defined with respect to the product measure (ρ×λ)n by

pνπ(a 1 ,x 1 ,...,an,xn) =

∏n

t=1

π(at|a 1 ,x 1 ,...,at− 1 ,xt− 1 )pat(xt). (4.7)

The reader can again check (more abstractly) that (a) and (b) are satisfied by
the measurePνπdefined by

Pνπ(B) =

∫

B

pνπ(ω)(ρ×λ)n(dω) for allB∈Fn.

It is important to emphasize that this choice of (Ωn,Fn,Pνπ) is not unique. Instead, all that this shows is that a suitable probability space does exist. Furthermore, if some quantity of interest depends on the law of Hn, by Proposition 4.8, there is no loss in generality in choosing (Ωn,Fn,Pνπ) as the probability space.

A choice ofλsuch thatPiλfor allialways exists sinceλ=

∑k i=1Pi satisfies this condition. For direct calculations another choice is usually more convenient. For example, the counting measure when (Pi) are discrete and the Lebesgue measure for continuous (Pi).

There is another way to define the probability space, which can be useful.
Define a collection of independent random variables (Xsi)s∈[n],i∈[k]such that the
law ofXtiisPi. By Theorem 2.4 these random variables may be defined on
(Ω,F) where Ω =RnkandF=B(Rnk). Then letXt=XtAtwhere the actions
AtareFt− 1 -measurable whereFt− 1 =σ(A 1 ,X 1 ,...,At− 1 ,Xt− 1 ). Yet another
way is to define (Xsi)s,ias above, but letXt=XTAt(t),At. This corresponds to
sampling astack of rewardsfor each arm at the beginning of the game. Each
time the learner chooses an action they receive the reward on top of the stack.
All of these models are convenient from time to time. The important thing is
that it does not matter which model we choose because the quantity of ultimate

Bandit Algorithms

∫

∫

Get our desktop app

Company

Features

Documentation

Resources