Bandit Algorithms

11.5 Notes 153

7 Building on the previous note, suppose the reward in roundtis Xt = ft(A 1 ,...,At) andf 1 ,...,fnare a sequence of functions chosen in advance by the adversary withft: [k]t→[0,1]. Let Π⊂[k]nbe a set of action-sequences. Then the expectedpolicy regretwith respect to Π is

max a 1 ,...,an∈Π

∑n

t=1

ft(a 1 ,...,at)−E

[n ∑

t=1

ft(A 1 ,...,At)

]

.

Even if Π only consists of constant sequences, there still does not exist a policy guaranteeing sublinear regret. The reason is simple. Consider the two candidate choices off 1 ,...,fn. In the first choiceft(a 1 ,...,at) =I{a 1 = 1}and in the second we haveft(a 1 ,...,at) =I{a 1 = 2}. Clearly the learner must suffer linear regret in at least one of these two reactive bandit environments. The problem is that the learner’s decision in the first round determines the rewards available in all subsequent rounds and there is no time for learning. By making additional assumptions sublinear regret is possible, however. For example, by assuming the adversary has limited memory [Arora et al., 2012]. 8 There is a common misconception that the adversarial framework is a good fit for nonstationary environments. While the framework does not assume the rewards are stationary, the regret concept used in this chapter has stationarity built in. A policy designed for minimizing the regret relative to the best action in hindsight is seldom suitable for nonstationary bandits where the whole point is to adapt to changes in the optimal arm. In such cases a better benchmark is to compete with a sequence of actions. For more on nonstationary bandits see Chapter 31. 9 The estimators in Eq. (11.3) and Eq. (11.6) both have conditional variance Vt[Xˆti]≈ 1 /Pti, which blows up for smallPti. It is instructive to think about whether and howPtican take on very small values. Consider the loss-based estimator given by(11.6). For this estimator, whenPtAtandXtare both small, XˆtAtcan take on a large negative value. Through the update formula(11.7) this then translates intoPt+1,Atbeing squashed aggressively towards zero. A similar issue arises with the reward-based estimator given by(11.3). The difference is that now it will be a ‘positive surprise’ (PtAtsmall,Xtlarge) that pushes the probabilities towards zero. But note that in this casePt+1,iis pushed towards zero for alli 6 =At. This means that dangerously small probabilities are expected to be more frequent for the gains estimator Eq. (11.3).

10 We argued at the beginning of the chapter that deterministic policies are
no good for adversarial bandit problems, which rules out all of the policies
analyzed in Part II. We also showed the regret of Exp3 grows with at most the
square root of the horizon on both stochastic and nonstochastic bandits. One
might wonder if there exists a policy with (near-)optimal regret for adversarial
bandits and logarithmic regret for stochastic bandits. There is a line of work
addressing this question, which shows that such algorithms do exist [Bubeck
and Slivkins, 2012, Seldin and Slivkins, 2014, Auer and Chiang, 2016, Seldin

Bandit Algorithms

]

.

Get our desktop app

Company

Features

Documentation

Resources