Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

18 Contextual Bandits


In many bandit problems the learner has access to additional information that may
help predict the quality of the actions. Imagine designing a movie recommendation
system where users sequentially request recommendations for which movie to
watch next. It would be inadvisable to ignore demographic information about the
user making the request, or other contextual history such as previously watched
movies or ratings. None of the algorithms presented so far make use of this kind of
additional information. Indeed, they optimize a benchmark (the regret) that also
disregards such contextual data. Essentially they would try to identify the best
single movie in hindsight. In this chapter we present an augmented framework
and regret definition that better models real-world problems where contextual
information is available.

Whenever you design a new benchmark, there are several factors to consider.
Competing with a poor benchmark does not make sense, since even an
algorithm that perfectly matches the benchmark will perform poorly. At
the same time, competing with a better benchmark can be harder from a
learning perspective and this penalty must be offset against the benefits.

The tradeoff just described is fundamental to all machine learning problems.
In statistical estimation, the analoguous tradeoff is known as thebias-variance
tradeoff. We will not attempt to answer the question of how to resolve this
tradeoff in this chapter because first we need to see how to effectively compete with
improved benchmarks. The good news is that many of the techniques developed
earlier are easily generalized.

18.1 Contextual bandits: one bandit per context


While contextual bandits can be studied in both the adversarial and stochastic
frameworks, in this chapter we focus on thek-armed adversarial model. As usual,
the adversary secretly chooses (xt)nt=1wherext∈[0,1]kwithxtithe reward
associated with armiin roundt. The adversary also secretly chooses a sequence
of contexts (ct)nt=1wherect∈ C, whereCis a set of possible contexts. In each
Free download pdf