Bandit Algorithms

(Jeff_L) #1
4.6 The canonical bandit model ( ) 63

the measure in terms of a distribution. Letpi(0) =Pi({ 0 }) andpi(1) = 1−pi(0)
and define

pνπ(a 1 ,x 1 ,...,an,xn) =

∏n

t=1

π(at|a 1 ,x 1 ,...,at− 1 ,xt− 1 )pat(xt).

The reader can check thatpνπis a distribution on ([k]×{ 0 , 1 })nand that the
associated measure satisfies (a) and (b) above. Making this argument rigorous
when (Pi) are not discrete requires the use of Radon-Nikodym derivatives. Letλ
be aσ-finite measure on (R,B(R)) for whichPiis absolutely continuous with
respect toλfor alli. Next letpi=dPi/dλbe the Radon-Nikodym derivative of
Piwith respect toλ, which is a functionpi:R→Rsuch that



Bpidλ=Pi(B)
for allB∈B(R). Lettingρbe the counting measure withρ(B) =|B|, the density
pνπ: Ω→Rcan now be defined with respect to the product measure (ρ×λ)n
by

pνπ(a 1 ,x 1 ,...,an,xn) =

∏n

t=1

π(at|a 1 ,x 1 ,...,at− 1 ,xt− 1 )pat(xt). (4.7)

The reader can again check (more abstractly) that (a) and (b) are satisfied by
the measurePνπdefined by


Pνπ(B) =


B

pνπ(ω)(ρ×λ)n(dω) for allB∈Fn.

It is important to emphasize that this choice of (Ωn,Fn,Pνπ) is not unique.
Instead, all that this shows is that a suitable probability space does exist.
Furthermore, if some quantity of interest depends on the law of Hn, by
Proposition 4.8, there is no loss in generality in choosing (Ωn,Fn,Pνπ) as the
probability space.

A choice ofλsuch thatPiλfor allialways exists sinceλ=

∑k
i=1Pi
satisfies this condition. For direct calculations another choice is usually more
convenient. For example, the counting measure when (Pi) are discrete and
the Lebesgue measure for continuous (Pi).

There is another way to define the probability space, which can be useful.
Define a collection of independent random variables (Xsi)s∈[n],i∈[k]such that the
law ofXtiisPi. By Theorem 2.4 these random variables may be defined on
(Ω,F) where Ω =RnkandF=B(Rnk). Then letXt=XtAtwhere the actions
AtareFt− 1 -measurable whereFt− 1 =σ(A 1 ,X 1 ,...,At− 1 ,Xt− 1 ). Yet another
way is to define (Xsi)s,ias above, but letXt=XTAt(t),At. This corresponds to
sampling astack of rewardsfor each arm at the beginning of the game. Each
time the learner chooses an action they receive the reward on top of the stack.
All of these models are convenient from time to time. The important thing is
that it does not matter which model we choose because the quantity of ultimate

Free download pdf