337
In the penultimate part we collect a few topics to which we could not dedicate
a whole part. When deciding what to include we balanced our subjective views
on what is important, pedagogical and sufficiently well-understood for a book.
Of course we have played favourites with our choices and hope the reader can
forgive us for the omissions. We spend the rest of this intro outlining some of the
omitted topics.
Continuous-armed bandits
There is a small literature on bandits where the number of actions is infinitely
large. We covered the linear case in earlier chapters, but the linear assumption
can be relaxed significantly. LetAbe an arbitrary set andFa set of functions
fromA→R. The learner is given access to the action setAand function class
F. In each round the learner chooses an actionAt∈ Aand receives reward
Xt=f(At) +ηtwhereηtis noise andf∈Fis fixed, but unknown. Of course
this setup is general enough to model all of the stochastic bandits so far, but is
perhaps too general to say much. One interesting relaxation is the case whereA
is a metric space andFis the set of Lipschitz functions. We refer the reader to
papers by Kleinberg [2005], Auer et al. [2007], Kleinberg et al. [2008], Bubeck
et al. [2011], Slivkins [2014], Magureanu et al. [2014], Combes et al. [2017], as
well as the book of Slivkins [2018].
Dueling bandits
In the dueling bandit problem the learner chooses two arms in each roundAt 1 ,At 2.
Rather than observing a reward for each arm the learner observes the winner of
a ‘dual’ between the two arms. LetKbe the number of arms andP∈[0,1]K×K
be a matrix wherePijis the probability that armibeats armjin a dual. It is
natural to assume thatPij= 1−Pji. A common, but slightly less justifiable,
assumption is the existence of a total ordering on the arms such that ifij,
thenPij> 1 /2. There are at least two notions of regret. Leti∗be the optimal
arm so thati∗jfor allj 6 =i∗. Then the strong/weak regret are defined by
Strong regret =E
[n
∑
t=1
(Pi∗,At 1 +Pi∗,At 2 −1)
]
.
Weak regret =E
[n
∑
t=1
min{Pi∗,At 1 − 1 / 2 ,Pi∗,At 2 − 1 / 2 }
]
.
Both definitions measure the number of times arms with low probability of
winning a duel against the optimal arm is played. The former definition only
vanishes whenAt 1 =At 2 =i∗, while the latter is zero as soon asi∗∈{At 1 ,At 2 }.
The dueling bandit problem was introduced by Yue et al. [2009] and has seen
quite a lot of interest since then [Yue and Joachims, 2009, 2011, Ailon et al.,
2014, Zoghi et al., 2014, Jamieson et al., 2015, Zoghi et al., 2015, Dud ́ık et al.,
2015, Komiyama et al., 2015a, Wu and Liu, 2016, Zimmert and Seldin, 2018].