Bandit Algorithms

(Jeff_L) #1
170

Until now we have indulged ourselves by presenting algorithms and upper
bounds on their regret. As satisfying as this is, the real truth of a problem is
usually to be found in the lower bounds. There are several reasons for this:

1 An upper bound does not tell you much about what you could be missing
out on. The only way to demonstrate that your algorithm really is (close to)
optimal is to prove a lower bound showing that no algorithm can do better.
2 The second reason is that lower bounds are often more informative in the
sense that it usually turns out to be easier to get the lower bound right than
the upper bound. History shows a list of algorithms with steadily improving
guarantees until eventually someone hits upon the idea for which the upper
bound matches some known lower bound.
3 Finally, thinking about lower bounds forces you to understand what is hard
about the problem. This is so useful that the best place to start when attacking
a new problem is usually to try and prove lower bounds. Too often we have
not heeded our own advice and started trying to design an algorithm, only
to discover later that had we tackled the lower bound first, then the right
algorithm would have fallen in our laps with almost no effort at all.

So what is the form of a typical lower bound? In the chapters that follow we will
see roughly two flavors. The first is the worst-case lower bound, which corresponds
to a claim of the form
“For any policy you give me, I will give you an instance of a bandit problemνon which
the regret is at leastL”.

Results of this kind have an adversarial flavor, which makes them suitable for
understanding the robustness of a policy. The second type is a lower bound on
the regret of an algorithm for specific instances. These bounds have a different
form that usually reads like the following:
“If you give me areasonablepolicy, then its regret on any instanceνis at leastL(ν)”.

The statement only holds for some policies – the ‘reasonable’ ones, whatever that
means. But the guarantee is also more refined because bound controls the regret
for these policies on every instance by a function that depends on this instance.
This kind of bound will allow us to show that the instance-dependent bounds
for stochastic bandits ofO(



i:∆i> 0 ∆i+log(n)/∆i) are not improvable. The
inclusion of the word ‘reasonable’ is unfortunately necessary. For every bandit
instanceν there is a policy that just chooses the optimal action inν. Such
policies are not reasonable because they have linear regret for bandits with a
different optimal arm. In the chapters that follow we will see various ways to
define ‘reasonable’ in a way that is simultaneously rigorous and, well, reasonable.
The contents of this part is roughly as follows. First we introduce the definition
of worst-case regret and discuss the line of attack for proving lower bounds
(Chapter 13). The next chapter takes us on a brief excursion into information
theory where we explain the necessary mathematical tools (Chapter 14). Readers

Free download pdf