Bandit Algorithms

339

[Joulani et al., 2013, Desautels et al., 2014, Cesa-Bianchi et al., 2016, Vernade et al., 2017, 2018, Pike-Burke et al., 2018, and others]. Learning with delays is an interesting example where the adversarial and stochastic models lead to quite different outcomes. In general the increase in regret due to rewards being delayed by at mostτrounds is a multiplicative

√

τfactor for adversarial models and an additive term only for stochastic models.

Graph feedback
There is growing interest in feedback models that lie between the full information
and bandit settings. One way to do this is to letGbe a directed graph withK
vertices. The adversary chooses a sequence of loss vectors in [0,1]Kas usual. In
each round the learner chooses a vertex and observes the loss corresponding to that
vertex and its neighbours. The full information and bandit settings are recovered
by choosing the graph to be fully connected or have no edges respectively, but of
course there are many interesting regimes in between. There are many variants on
this basic problem. For example,Gmight change in each round or be undirected.
Or perhaps the graph is changing and the learner only observes it after choosing
an action. The reader can explore this topic by reading the articles by Mannor
and Shamir [2011], Alon et al. [2013], Koc ́ak et al. [2014], Alon et al. [2015] or
the short book by Valko [2016].

Bandit Algorithms

√

Get our desktop app

Company

Features

Documentation

Resources