4.6 The canonical bandit model ( ) 61
Lemma 4.5 tells us that a learner should aim to use an arm with a larger
suboptimality gap proportionally fewer times.
Note that the suboptimality gap for optimal arm(s) is zero.
Proof of Lemma 4.5 SinceRnis based on summing over rounds, and the right-
hand side of the lemma statement is based on summing over actions, to convert
one sum into the other one we introduce indicators. In particular, note that for any
fixedtwe have
∑
a∈AI{At=a}= 1. HenceSn=
∑
tXt=
∑
t
∑
aXtI{At=a}
and thus
Rn=nμ∗−E[Sn] =
∑
a∈A
∑n
t=1
E[(μ∗−Xt)I{At=a}]. (4.6)
The expected reward in roundtconditioned onAtisμAt, which means that
E[(μ∗−Xt)I{At=a} |At] =I{At=a}E[μ∗−Xt|At]
=I{At=a}(μ∗−μAt)
=I{At=a}(μ∗−μa)
=I{At=a}∆a.
The result is completed by plugging this into Eq. (4.6) and using the definition
ofTa(n).
The argument fails whenAis uncountable because you cannot introduce the
sum over actions. Of course the solution is to use an integral, but for this we need
to assume (A,G) is a measurable space. Given a banditνand policyπdefine
measureGon (A,G) by
G(U) =E
[n
∑
t=1
I{At∈U}
]
,
where the expectation is taken with respect to the measure on outcomes induced
by the interaction ofπandν.
Lemma4.6.Provided that everything is wel l defined and appropriately measurable,
Rn=E
[n
∑
t=1
∆At
]
=
∫
A
∆adG(a).
For those worried about how to ensure everything is well defined, see Section 4.7.
4.6 The canonical bandit model ( )
In most cases the underlying probability space that supports the random rewards
and actions is never mentioned. Occasionally, however, it becomes convenient to