Bandit Algorithms

(Jeff_L) #1
141

The usefulness of the stochastic model depends on the setting. In particular,
the designer of the bandit algorithm must carefully evaluate whether stochasticity,
stability of the mean and independence are reasonable assumptions. For some
applications the answer will probably be yes, while in others the practitioner may
seek something more robust. This latter situation is the topic of the next few
chapters.

Adversarial bandits


Theadversarial banditmodel abandons almost all the assumptions on how
the rewards are generated, so much so that the ‘environment’ is often called the
‘adversary’. The adversary has a great deal of power in this model, including the
ability to examine the code of the proposed algorithms and choose the rewards
accordingly. All that is kept from the previous chapters is that the objective will
be framed in terms of how well a policy is able to compete with the best action
in hindsight.
At first sight it seems remarkable that one can say anything at all about such
a general model. And yet, it turns out that this model is not much harder than
the stochastic bandit problem. Why this holds and how to design algorithms that
achieve these guarantees will be explained in the following chapters.
To give you a glimmer of hope, imagine playing the following simple bandit
game with a friend. The horizon isn= 1 and you have two actions. The game
proceeds as follows:
1 You tell your friend your strategy for choosing an action.
2 Your friend secretly chooses rewardsx 1 ∈{ 0 , 1 }andx 2 ∈{ 0 , 1 }.
3 You implement your strategy to selectA∈{ 1 , 2 }and receive rewardxA.
4 The regret isR= max{x 1 ,x 2 }−xA.
Clearly if your friend choosesx 1 =x 2 , then your regret is zero no matter what.
Now lets suppose you implement the deterministic strategyA= 1. Then your
friend can choosex 1 = 0 andx 2 = 1 and your regret isR= 1. The trick to
improve on this is to randomize. If you tell your friend: “I will chooseA= 1 with
probability one half”, then the best she can do is choosex 1 = 1 andx 2 = 0 (or
reversed) and your expected regret isR= 1/2. You are forgiven if you did not
settle on this solution yourself because we did not tell you that a strategy may
be randomized. With such a short horizon you cannot do better than this, but
for longer games the relative advantage of the adversary decreases.


Notes


1 Having derived a bandit algorithm one can ask how much it takes to break
the guarantees. For example, under what circumstances is the regret of UCB
Free download pdf