Bandit Algorithms

(Jeff_L) #1

BIBLIOGRAPHY 539


H. Robbins and D. Siegmund. A class of stopping rules for testing parametric
hypotheses. InProceedings of the Sixth Berkeley Symposium on Mathematical
Statistics and Probability, Volume 4: Biology and Health, pages 37–41. University
of California Press, 1972. [250]
H. Robbins, D. Sigmund, and Y. Chow. Great expectations: the theory of optimal
stopping.Houghton-Nifflin, 7:631–640, 1971. [426]
S. Robertson. The probability ranking principle in IR. 33:294–304, 12 1977. [375]
R. T. Rockafellar.Convex analysis. Princeton university press, 2015. [298]
R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk.
Journal of risk, 2:21–42, 2000. [65]
C. A. Rogers.Packing and covering. Cambridge University Press, 1964. [246]
S. M. Ross.Introduction to Stochastic Dynamic Programming. Academic Press,
New York, 1983. [500]
P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits.
Mathematics of Operations Research, 35(2):395–411, 2010. [91, 235, 278]
D. Russo. Simple Bayesian algorithms for best arm identification. In V. Feldman,
A. Rakhlin, and O. Shamir, editors,29th Annual Conference on Learning
Theory, volume 49 ofProceedings of Machine Learning Research, pages 1417–
1418, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
[388]
D. Russo and B. Van Roy. Eluder dimension and the sample complexity
of optimistic exploration. In C. J. C. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K. Q. Weinberger, editors,Advances in Neural Information
Processing Systems 26, NIPS, pages 2256–2264. Curran Associates, Inc., 2013.
[235]
D. Russo and B. Van Roy. Learning to optimize via information-directed
sampling. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.
Weinberger, editors,Advances in Neural Information Processing Systems 27,
NIPS, pages 1583–1591. Curran Associates, Inc., 2014a. [286, 444]
D. Russo and B. Van Roy. Learning to optimize via posterior sampling.
Mathematics of Operations Research, 39(4):1221–1243, 2014b. [444]
D. Russo and B. Van Roy. An information-theoretic analysis of Thompson
sampling.Journal of Machine Learning Research, 17(1):2442–2471, 2016. ISSN
1532-4435. [350, 443, 444]
D. Russo, B. Van Roy, A. Kazerouni, and I. Osband. A tutorial on Thompson
sampling.arXiv preprint arXiv:1707.02038, 2017. [445]
A. Rustichini. Minimizing regret: The general case. Games and Economic
Behavior, 29(1):224–243, 1999. [471, 472]
A. Salomon, J. Audibert, and I. Alaoui. Lower bounds and selectivity of weak-
consistent policies in stochastic multi-armed bandit problem. Journal of
Machine Learning Research, 14(Jan):187–207, 2013. [201]
P. Samuelson. A note on measurement of utility.The Review of Economic Studies,
4(2):pp. 155–161, 1937. [425]

Free download pdf