Bandit Algorithms

338

Convex bandits LetA⊂Rdbe a convex set. The convex bandit problem comes in both stochastic and adversarial varieties. In both cases the learner choosesAtfromA. In the stochastic case the learner receives a rewardXt=f(At) +ηtwherefis an unknown convex function andηtis noise. In the adversarial setting the adversary chooses a sequence of convex functionsf 1 ,...,fnand the learner receives reward Xt=ft(At). This turned out to be a major challenge over the last decade with most approaches leading to suboptimal regret in terms of the horizon. The best bounds in the stochastic case are by Agarwal et al. [2011] while in the adversarial case there has been a lot of recent progress [Bubeck et al., 2015a, Bubeck and Eldan, 2016, Bubeck et al., 2017]. In both cases the dependence of the regret on the horizon isO(

√

n), which is optimal in the worst case. The major challenge that remains is to minimize the dependence on the dimension. Many open question remain, such as the optimal dependence on the dimension, or the related problem of designing practical low-regret algorithms. The interested reader may consult Shamir [2013], Hu et al. [2016] for some of the open problems.

Budgeted bandits In many problems choosing an action costs some resources. In the bandits with knapsacks problem the learner starts with a fixed budgetB∈[0,∞)dover dresource types. Like in the standardK-armed stochastic bandit, the learner choosesAt∈[K] and receives a rewardXtsampled from a distribution depending onAt. The twist is that the game does not end after a fixed number of rounds. Instead, in each round the environment samples a cost vectorCt∈[0,1]dfrom a distribution that depends onAt. The game ends in the first roundτ where there exists ani∈[d] such that

∑τ t=1Cti> Bi. This line of work was started by Badanidiyuru et al. [2013] and has been extended in many directions by Agrawal and Devanur [2014], Tran-Thanh et al. [2012], Ashwinkumar et al. [2014], Xia et al. [2015], Agrawal and Devanur [2016], Tran-Thanh et al. [2010], Hanawal et al. [2015]. A somewhat related idea is the conservative bandit problem where the goal is to minimize regret subject to the constraint that the learner must not be much worse than some known baseline. The constraint limits the amount of exploration and makes the regret guarantees slightly worse [Sui et al., 2015, Wu et al., 2016, Kazerouni et al., 2017].

Learning with delays
In many practical applications the feedback to the learner is not immediate. The
time between clicking on a link and buying a product could be minutes, days,
weeks or longer. Similarly, the response to a drug does not come immediately.
In most cases the learner does not have the choice to wait before making the
next decision. Buyers and patients just keep coming. Perhaps the first paper
for online learning with delays is by Weinberger and Ordentlich [2002], who
consider the full information setting. Recently this has become a hot topic and
there has been a lot of follow-up work extending the results in various directions

Bandit Algorithms

√

Get our desktop app

Company

Features

Documentation

Resources