Bandit Algorithms

(Jeff_L) #1
339

[Joulani et al., 2013, Desautels et al., 2014, Cesa-Bianchi et al., 2016, Vernade
et al., 2017, 2018, Pike-Burke et al., 2018, and others]. Learning with delays is
an interesting example where the adversarial and stochastic models lead to quite
different outcomes. In general the increase in regret due to rewards being delayed
by at mostτrounds is a multiplicative


τfactor for adversarial models and an
additive term only for stochastic models.

Graph feedback
There is growing interest in feedback models that lie between the full information
and bandit settings. One way to do this is to letGbe a directed graph withK
vertices. The adversary chooses a sequence of loss vectors in [0,1]Kas usual. In
each round the learner chooses a vertex and observes the loss corresponding to that
vertex and its neighbours. The full information and bandit settings are recovered
by choosing the graph to be fully connected or have no edges respectively, but of
course there are many interesting regimes in between. There are many variants on
this basic problem. For example,Gmight change in each round or be undirected.
Or perhaps the graph is changing and the learner only observes it after choosing
an action. The reader can explore this topic by reading the articles by Mannor
and Shamir [2011], Alon et al. [2013], Koc ́ak et al. [2014], Alon et al. [2015] or
the short book by Valko [2016].

Free download pdf