Bandit Algorithms

(Jeff_L) #1
34.6 Bayesian regret 405

34.6 Bayesian regret


Recall that the regret of policyπink-armed bandit environmentνovernrounds
is

Rn(π,ν) =nμ∗−E

[n

t=1

Xt

]


, (34.7)


whereμ∗=maxi∈[k]μiandμiis the mean ofPνi. Given ak-armed Bayesian
bandit environment (E,G,Q,P) and a policyπtheBayesian regretis

BRn(π,Q) =


E

Rn(π,ν)dQ(ν).

The dependence onE,G and P is omitted on the grounds that these are
always self-evident from the context. The Bayesian optimal regret isBR∗n(Q) =
infπBRn(π,Q) and the optimal (regret-minimizing) policy is

π∗= argminπBRn(π,Q). (34.8)

Note that the regret minimizing policy is the same as the reward maximizing
policyπ∗=argmaxπEPQPπ[

∑n
t=1Xt], which is known as theBayesian optimal
policyunder priorQ. In all generality there is no guarantee that the (Bayes)
optimal policy exists, but the nonnegativity of the Bayesian regret ensures that
for anyε >0 there exists a policyπwith BRn(π,Q)≤BR∗n(Q) +ε.

The fact that the expected regretRn(π,ν) is nonnegative for allνandπ
means that the Bayesian regret is always nonnegative. Perhaps less obviously,
the Bayesian regret of the Bayesian optimal policy can be strictly greater
than zero (Exercise 34.5).

34.7 Notes


1 In Chapter 4 we definedEas a set of tuples of probability distributions over
the reals. In a Bayesian bandit environment (E,G,Q,P) the setEis arbitrary.
The reward distributions are given by the probability kernelP. The probability
kernel is needed because we are now integrating the regret overE, which may
not be measurable without additional conditions.
2 The Bayesian regret of an algorithm is less informative than the frequentist
regret. By this we mean that a bound onBRn(π,Q) does not generally imply
a meaningful bound onRn(π,ν), while ifRn(π,ν)≤f(ν) for a measurable
functionf, then BRn(π,Q)≤E[f(ν)]. This is not an argument against using
a Bayesian algorithm but rather an argument for the need to analyze the
frequentist regret of Bayesian algorithms.
Free download pdf