Bandit Algorithms

(Jeff_L) #1
140

Statistician George E. P. Box is famous for writing “All models are wrong,
but some are useful”. In the stochastic bandit model the reward is sampled from
a distribution that depends only on the chosen action. It does not take much
thought to realize this model is almost always wrong. At the macroscopic level
typically considered in bandit problems there is not much that is stochastic about
the world. And even if there was, it is hard to rule out the existence of other
factors (observed or otherwise) influencing the rewards.
The quotation suggests we should not care whether or not the stochastic bandit
model is right. Instead, we should ask if it is useful. In science, models are used for
predicting the outcomes of future experiments and their usefulness is measured by
the quality of the predictions. But how can this be applied to bandit problems?
What predictions can be made based on bandit models? In this respect, we
postulate the following.


The point of bandit models is to predict the performance of algorithms on
future problem instances.

A model can fail in two fundamentally different ways. It can be too specific,
imposing assumptions that are so detached from reality that a catastrophic
mismatch between actual and predicted performance may arise.

Not all assumptions are equally important. It is a critical assumption in
stochastic bandits that the mean reward of individual arms does not change
(significantly) over time. On the other hand, the assumption that a single,
arm-dependent distribution generates the rewards for a given arm plays
a relatively insignificant role. The reader is encouraged to think of cases
when the constancy of arm-distributions plays no role, and also of cases
when it does. And furthermore, to decide to what extent the algorithms can
tolerate deviations from the assumption that the means of arms stay the
same. Stochastic bandits where the means of the arms are changing over
time are callednonstationaryand are the topic of Chapter 31.

The second mode of failure occurs when a model is too general, which makes
the resulting algorithm overly cautious and harms performance.

If a highly specialized model is actually correct, then the resulting algorithms
usually dominate algorithms derived for a more general model. This is a
general manifestation of the bias-variance tradeoff, well known in supervised
learning and statistics. The holy grail is to find algorithms that work
‘optimally’ across a range of models. The reader should think about examples
from the previous chapters that illustrate these points.
Free download pdf