Bandit Algorithms

4.6 The canonical bandit model ( ) 61

Lemma 4.5 tells us that a learner should aim to use an arm with a larger suboptimality gap proportionally fewer times.

Note that the suboptimality gap for optimal arm(s) is zero.

Proof of Lemma 4.5 SinceRnis based on summing over rounds, and the right- hand side of the lemma statement is based on summing over actions, to convert one sum into the other one we introduce indicators. In particular, note that for any fixedtwe have

∑

a∈AI{At=a}= 1. HenceSn=

∑

tXt=

∑

t

∑

aXtI{At=a} and thus

Rn=nμ∗−E[Sn] =

∑

a∈A

∑n

t=1

E[(μ∗−Xt)I{At=a}]. (4.6)

The expected reward in roundtconditioned onAtisμAt, which means that E[(μ∗−Xt)I{At=a} |At] =I{At=a}E[μ∗−Xt|At] =I{At=a}(μ∗−μAt) =I{At=a}(μ∗−μa) =I{At=a}∆a.

The result is completed by plugging this into Eq. (4.6) and using the definition ofTa(n).

The argument fails whenAis uncountable because you cannot introduce the sum over actions. Of course the solution is to use an integral, but for this we need to assume (A,G) is a measurable space. Given a banditνand policyπdefine measureGon (A,G) by

G(U) =E

[n ∑

t=1

I{At∈U}

]

,

where the expectation is taken with respect to the measure on outcomes induced by the interaction ofπandν. Lemma4.6.Provided that everything is wel l defined and appropriately measurable,

Rn=E

[n ∑

t=1

∆At

]

=

∫

A

∆adG(a).

For those worried about how to ensure everything is well defined, see Section 4.7.

4.6 The canonical bandit model ( )

In most cases the underlying probability space that supports the random rewards and actions is never mentioned. Occasionally, however, it becomes convenient to

Bandit Algorithms

∑

∑

∑

∑

∑

G(U) =E

]

,

]

=

∫

4.6 The canonical bandit model ( )

Get our desktop app

Company

Features

Documentation

Resources