Bandit Algorithms

(Jeff_L) #1

BIBLIOGRAPHY 515


Information Processing Systems 24, NIPS, pages 1035–1043. Curran Associates,
Inc., 2011. [338]
A. Agarwal, D. P. Foster, D. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic
convex optimization with bandit feedback.SIAM Journal on Optimization, 23
(1):213–240, 2013. [389]
A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the
monster: A fast and simple algorithm for contextual bandits. In E. P. Xing
and T. Jebara, editors,Proceedings of the 31st International Conference on
Machine Learning, volume 32 ofProceedings of Machine Learning Research,
pages 1638–1646, Bejing, China, 22–24 Jun 2014. PMLR. [221, 222, 223]
A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Langford, S. Lee, J. Li,
D. Melamed, G. Oshri, and O. Ribas. Making contextual decisions with
low technical debt.arXiv preprint arXiv:1606.03966, 2016. [16]
R. Agrawal. Sample mean based index policies with O(logn) regret for the multi-
armed bandit problem.Advances in Applied Probability, pages 1054–1078, 1995.
[106, 116]
S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex
knapsacks. InProceedings of the 15th ACM conference on Economics and
computation, pages 989–1006. ACM, 2014. [338]
S. Agrawal and N. R. Devanur. Linear contextual bandits with knapsacks. In
Advances in Neural Information Processing Systems 29, NIPS, pages 3458–3467.
Curran Associates Inc., 2016. [338]
S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed
bandit problem. InProceedings of Conference on Learning Theory (COLT),



  1. [444]
    S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson
    sampling. In C. M. Carvalho and P. Ravikumar, editors,Proceedings of the 16th
    International Conference on Artificial Intel ligence and Statistics, volume 31 of
    Proceedings of Machine Learning Research, pages 99–107, Scottsdale, Arizona,
    USA, 29 Apr–01 May 2013a. PMLR. [442, 444]
    S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear
    payoffs. In S. Dasgupta and D. McAllester, editors,Proceedings of the 30th
    International Conference on Machine Learning, volume 28 ofProceedings of
    Machine Learning Research, pages 127–135, Atlanta, Georgia, USA, 17–19 Jun
    2013b. PMLR. [444]
    S. Agrawal and R. Jia. Optimistic posterior sampling for reinforcement learning:
    worst-case regret bounds. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
    R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural
    Information Processing Systems 30, NIPS, pages 1184–1194. Curran Associates,
    Inc., 2017. [501]
    S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. Thompson sampling for
    the MNL-bandit. In S. Kale and O. Shamir, editors,Proceedings of the 2017
    Conference on Learning Theory, volume 65 ofProceedings of Machine Learning
    Research, pages 76–78, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR. [444]

Free download pdf