Bandit Algorithms

(Jeff_L) #1
338

Convex bandits
LetA⊂Rdbe a convex set. The convex bandit problem comes in both stochastic
and adversarial varieties. In both cases the learner choosesAtfromA. In the
stochastic case the learner receives a rewardXt=f(At) +ηtwherefis an
unknown convex function andηtis noise. In the adversarial setting the adversary
chooses a sequence of convex functionsf 1 ,...,fnand the learner receives reward
Xt=ft(At). This turned out to be a major challenge over the last decade with
most approaches leading to suboptimal regret in terms of the horizon. The best
bounds in the stochastic case are by Agarwal et al. [2011] while in the adversarial
case there has been a lot of recent progress [Bubeck et al., 2015a, Bubeck and
Eldan, 2016, Bubeck et al., 2017]. In both cases the dependence of the regret on
the horizon isO(


n), which is optimal in the worst case. The major challenge that
remains is to minimize the dependence on the dimension. Many open question
remain, such as the optimal dependence on the dimension, or the related problem
of designing practical low-regret algorithms. The interested reader may consult
Shamir [2013], Hu et al. [2016] for some of the open problems.

Budgeted bandits
In many problems choosing an action costs some resources. In the bandits with
knapsacks problem the learner starts with a fixed budgetB∈[0,∞)dover
dresource types. Like in the standardK-armed stochastic bandit, the learner
choosesAt∈[K] and receives a rewardXtsampled from a distribution depending
onAt. The twist is that the game does not end after a fixed number of rounds.
Instead, in each round the environment samples a cost vectorCt∈[0,1]dfrom
a distribution that depends onAt. The game ends in the first roundτ where
there exists ani∈[d] such that

∑τ
t=1Cti> Bi. This line of work was started by
Badanidiyuru et al. [2013] and has been extended in many directions by Agrawal
and Devanur [2014], Tran-Thanh et al. [2012], Ashwinkumar et al. [2014], Xia
et al. [2015], Agrawal and Devanur [2016], Tran-Thanh et al. [2010], Hanawal
et al. [2015]. A somewhat related idea is the conservative bandit problem where
the goal is to minimize regret subject to the constraint that the learner must not
be much worse than some known baseline. The constraint limits the amount of
exploration and makes the regret guarantees slightly worse [Sui et al., 2015, Wu
et al., 2016, Kazerouni et al., 2017].

Learning with delays
In many practical applications the feedback to the learner is not immediate. The
time between clicking on a link and buying a product could be minutes, days,
weeks or longer. Similarly, the response to a drug does not come immediately.
In most cases the learner does not have the choice to wait before making the
next decision. Buyers and patients just keep coming. Perhaps the first paper
for online learning with delays is by Weinberger and Ordentlich [2002], who
consider the full information setting. Recently this has become a hot topic and
there has been a lot of follow-up work extending the results in various directions

Free download pdf