Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

18 Contextual Bandits

In many bandit problems the learner has access to additional information that may help predict the quality of the actions. Imagine designing a movie recommendation system where users sequentially request recommendations for which movie to watch next. It would be inadvisable to ignore demographic information about the user making the request, or other contextual history such as previously watched movies or ratings. None of the algorithms presented so far make use of this kind of additional information. Indeed, they optimize a benchmark (the regret) that also disregards such contextual data. Essentially they would try to identify the best single movie in hindsight. In this chapter we present an augmented framework and regret definition that better models real-world problems where contextual information is available.

Whenever you design a new benchmark, there are several factors to consider. Competing with a poor benchmark does not make sense, since even an algorithm that perfectly matches the benchmark will perform poorly. At the same time, competing with a better benchmark can be harder from a learning perspective and this penalty must be offset against the benefits.

The tradeoff just described is fundamental to all machine learning problems. In statistical estimation, the analoguous tradeoff is known as thebias-variance tradeoff. We will not attempt to answer the question of how to resolve this tradeoff in this chapter because first we need to see how to effectively compete with improved benchmarks. The good news is that many of the techniques developed earlier are easily generalized.

18.1 Contextual bandits: one bandit per context

While contextual bandits can be studied in both the adversarial and stochastic frameworks, in this chapter we focus on thek-armed adversarial model. As usual, the adversary secretly chooses (xt)nt=1wherext∈[0,1]kwithxtithe reward associated with armiin roundt. The adversary also secretly chooses a sequence of contexts (ct)nt=1wherect∈ C, whereCis a set of possible contexts. In each

Bandit Algorithms

18 Contextual Bandits

18.1 Contextual bandits: one bandit per context

Get our desktop app

Company

Features

Documentation

Resources