BIBLIOGRAPHY 520
R. N. Bradt, S. M. Johnson, and S. Karlin. On sequential designs for maximizing
the sum ofnobservations. The Annals of Mathematical Statistics, pages
1060–1074, 1956. [427]
R. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm for
near-optimal reinforcement learning.Journal of Machine Learning Research, 3:
213–231, 2003. [502]
J. Bretagnolle and C. Huber. Estimation des densit ́es: risque minimax.Zeitschrift
f ̈ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 47(2):119–137, 1979.
[185, 193]
S. Bubeck and N. Cesa-Bianchi.Regret Analysis of Stochastic and Nonstochastic
Multi-armed Bandit Problems. Foundations and Trends in Machine Learning.
Now Publishers Incorporated, 2012. [15, 106, 152, 323, 326, 350]
S. Bubeck and R. Eldan. The entropic barrier: a simple and optimal universal self-
concordant barrier. In P. Gr ̈unwald, E. Hazan, and S. Kale, editors,Proceedings
of The 28th Conference on Learning Theory, volume 40 ofProceedings of
Machine Learning Research, pages 279–279, Paris, France, 03–06 Jul 2015.
PMLR. [307]
S. Bubeck and R. Eldan. Multi-scale exploration of convex functions and bandit
convex optimization. In V. Feldman, A. Rakhlin, and O. Shamir, editors,29th
Annual Conference on Learning Theory, volume 49 ofProceedings of Machine
Learning Research, pages 583–589, Columbia University, New York, New York,
USA, 23–26 Jun 2016. PMLR. [338, 443]
S. Bubeck and C. Liu. Prior-free and prior-dependent regret bounds for Thompson
sampling. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and
K. Q. Weinberger, editors,Advances in Neural Information Processing Systems
26 , NIPS, pages 638–646. Curran Associates, Inc., 2013. [445]
S. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial
bandits. InCOLT, pages 42.1–42.23, 2012. [153]
S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits
problems. InInternational conference on Algorithmic learning theory, pages
23–37. Springer, 2009. [388]
S. Bubeck, R. Munos, G. Stoltz, and Cs. Szepesv ́ari. X-armed bandits.Journal
of Machine Learning Research, 12:1655–1695, 2011. [337]
S. Bubeck, N. Cesa-Bianchi, and S. Kakade. Towards minimax policies for online
linear optimization with bandit feedback. InAnnual Conference on Learning
Theory, volume 23, pages 41–1. Microtome, 2012. [259, 307, 323]
S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE
Transactions on Information Theory, 59(11):7711–7717, 2013a. [111]
S. Bubeck, V. Perchet, and P. Rigollet. Bounded regret in stochastic multi-armed
bandits. In S. Shalev-Shwartz and I. Steinwart, editors,Proceedings of the
26th Annual Conference on Learning Theory, volume 30 ofProceedings of
Machine Learning Research, pages 122–134, Princeton, NJ, USA, 12–14 Jun
2013b. PMLR. [193]