BIBLIOGRAPHY 537
G. Neu. First-order regret bounds for combinatorial semi-bandits. In P. Gr ̈unwald,
E. Hazan, and S. Kale, editors,Proceedings of The 28th Conference on Learning
Theory, volume 40 ofProceedings of Machine Learning Research, pages 1360–
1375, Paris, France, 03–06 Jul 2015b. PMLR. [164, 320]
G. Neu, A. Gy ̈orgy, Cs. Szepesv ́ari, and A. Antos. Online Markov decision
processes under bandit feedback.IEEE Transactions on Automatic Control,
59(3):676–691, December 2014. [502]
J. Von Neumann and O. Morgenstern.Theory of Games and Economic Behavior.
Princeton University Press, Princeton, 1944. [65]
J. Ni ̃no-Mora. Computing a classic index for finite-horizon bandits.INFORMS
Journal on Computing, 23(2):254–267, 2011. [427]
P. A. Ortega and D. A. Braun. A minimum relative entropy principle for learning
and acting.Journal of Artificial Intel ligence Research, 38:475–511, 2010. [444]
R. Ortner and D. Ryabko. Online regret bounds for undiscounted continuous
reinforcement learning. InAdvances in Neural Information Processing Systems
25 , NIPS, pages 1763–1771, USA, 2012. Curran Associates Inc. [501, 502]
R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless Markov
bandits. In N. Bshouty, G. Stoltz, N. Vayatis, and T. Zeugmann, editors,
Algorithmic Learning Theory, pages 214–228, Berlin, Heidelberg, 2012. Springer
Berlin Heidelberg. [360]
I. Osband and B. Van Roy. Why is posterior sampling better than optimism for
reinforcement learning? In D. Precup and Y. W. Teh, editors,Proceedings of the
34th International Conference on Machine Learning, volume 70 ofProceedings
of Machine Learning Research, pages 2701–2710, International Convention
Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. [502]
I. Osband, D. Russo, and B. Van Roy. (more) efficient reinforcement learning via
posterior sampling. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani,
and K. Q. Weinberger, editors,Advances in Neural Information Processing
Systems 26, NIPS, pages 3003–3011. Curran Associates, Inc., 2013. [444, 502]
E. Ostrovsky and L. Sirota. Exact value for subgaussian norm of centered
indicator random variable.arXiv preprint arXiv:1405.6749, 2014. [81]
C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision
processes.Mathematics of operations research, 12(3):441–450, 1987. [497]
V. Perchet. Approachability of convex sets in games with partial monitoring.
Journal of Optimization Theory and Applications, 149(3):665–677, 2011. [473]
V. Perchet and P. Rigollet. The multi-armed bandit problem with covariates.
The Annals of Statistics, 41(2):693–721, 04 2013. [237]
G. Peskir and A. Shiryaev.Optimal stopping and free-boundary problems. Springer,
- [52, 426, 428]
A. Piccolboni and C. Schindelhauer. Discrete prediction games with arbitrary
feedback and loss. InComputational Learning Theory, pages 208–223. Springer, - [472]
C. Pike-Burke, S. Agrawal, Cs. Szepesv ́ari, and S. Grunewalder. Bandits
with delayed, aggregated anonymous feedback. In J. Dy and A. Krause,