Bandit Algorithms

170

Until now we have indulged ourselves by presenting algorithms and upper bounds on their regret. As satisfying as this is, the real truth of a problem is usually to be found in the lower bounds. There are several reasons for this:

1 An upper bound does not tell you much about what you could be missing out on. The only way to demonstrate that your algorithm really is (close to) optimal is to prove a lower bound showing that no algorithm can do better. 2 The second reason is that lower bounds are often more informative in the sense that it usually turns out to be easier to get the lower bound right than the upper bound. History shows a list of algorithms with steadily improving guarantees until eventually someone hits upon the idea for which the upper bound matches some known lower bound. 3 Finally, thinking about lower bounds forces you to understand what is hard about the problem. This is so useful that the best place to start when attacking a new problem is usually to try and prove lower bounds. Too often we have not heeded our own advice and started trying to design an algorithm, only to discover later that had we tackled the lower bound first, then the right algorithm would have fallen in our laps with almost no effort at all.

So what is the form of a typical lower bound? In the chapters that follow we will see roughly two flavors. The first is the worst-case lower bound, which corresponds to a claim of the form “For any policy you give me, I will give you an instance of a bandit problemνon which the regret is at leastL”.

Results of this kind have an adversarial flavor, which makes them suitable for understanding the robustness of a policy. The second type is a lower bound on the regret of an algorithm for specific instances. These bounds have a different form that usually reads like the following: “If you give me areasonablepolicy, then its regret on any instanceνis at leastL(ν)”.

The statement only holds for some policies – the ‘reasonable’ ones, whatever that
means. But the guarantee is also more refined because bound controls the regret
for these policies on every instance by a function that depends on this instance.
This kind of bound will allow us to show that the instance-dependent bounds
for stochastic bandits ofO(

∑

i:∆i> 0 ∆i+log(n)/∆i) are not improvable. The
inclusion of the word ‘reasonable’ is unfortunately necessary. For every bandit
instanceν there is a policy that just chooses the optimal action inν. Such
policies are not reasonable because they have linear regret for bandits with a
different optimal arm. In the chapters that follow we will see various ways to
define ‘reasonable’ in a way that is simultaneously rigorous and, well, reasonable.
The contents of this part is roughly as follows. First we introduce the definition
of worst-case regret and discuss the line of attack for proving lower bounds
(Chapter 13). The next chapter takes us on a brief excursion into information
theory where we explain the necessary mathematical tools (Chapter 14). Readers

Bandit Algorithms

∑

Get our desktop app

Company

Features

Documentation

Resources