Bandit Algorithms

(Jeff_L) #1

BIBLIOGRAPHY 525


S. Dong and B. Van Roy. An information-theoretic analysis for Thompson
sampling with many actions.arXiv preprint arXiv:1805.11845, 2018. [443]
J. L. Doob.Stochastic processes. Wiley, 1953. [52]
M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and
T. Zhang. Efficient optimal learning for contextual bandits. InProceedings
of the 27th Conference on Uncertainty in Artificial Intelligence, UAI, pages
169–178. AUAI Press, 2011. [223]
M. Dud ́ık, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi. Contextual
dueling bandits. In P. Gr ̈unwald, E. Hazan, and S. Kale, editors,Proceedings of
The 28th Conference on Learning Theory, volume 40 ofProceedings of Machine
Learning Research, pages 563–587, Paris, France, 03–06 Jul 2015. PMLR. [337]
R. M. Dudley.Uniform central limit theorems, volume 142. Cambridge university
press, 2014. [78, 83, 321]
C. G. Esseen. On the Liapounoff limit of error in the theory of probability.
Almqvist & Wiksell, 1942. [76]
E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit
and Markov decision processes. InComputational Learning Theory, pages
255–270. Springer, 2002. [387, 391]
E. Even-Dar, S. M. Kakade, and Y. Mansour. Experts in a Markov decision
process. InAdvances in Neural Information Processing Systems 17, NIPS,
pages 401–408, Cambridge, MA, USA, 2004. MIT Press. [502]
E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping
conditions for the multi-armed bandit and reinforcement learning problems.
Journal of Machine Learning Research, 7(Jun):1079–1105, 2006. [387]
V. V. Fedorov. Theory of optimal experiments.Academic Press, New York, 1972.
[255]
S. Filippi, O. Cappe, A. Garivier, and Cs. Szepesv ́ari. Parametric bandits: The
generalized linear case. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor,
R. S. Zemel, and A. Culotta, editors,Advances in Neural Information Processing
Systems 23, NIPS, pages 586–594. Curran Associates, Inc., 2010. [235]
D. Fink. A compendium of conjugate priors. 1997. [407]
D. Foster and A. Rakhlin. No internal regret via neighborhood watch. In N. D.
Lawrence and M. Girolami, editors,Proceedings of the 15th International
Conference on Artificial Intel ligence and Statistics, volume 22 ofProceedings of
Machine Learning Research, pages 382–390, La Palma, Canary Islands, 21–23
Apr 2012. PMLR. [473]
M. Frank and P. Wolfe. An algorithm for quadratic programming.Naval research
logistics quarterly, 3(1-2):95–110, 1956. [255]
S. Frederick, G. Loewenstein, and T. O’donoghue. Time discounting and time
preference: A critical review.Journal of economic literature, 40(2):351–401,



  1. [425]
    E. Frostig and G. Weiss. Four proofs of Gittins’ multiarmed bandit theorem.
    Applied Probability Trust, 70, 1999. [427]

Free download pdf