This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]
5 Concentration of Measure
Before we can start designing and analyzing algorithms we need one more tool
from probability theory calledconcentration of measure. Recall that the
optimal action is the one with the largest mean. Since the mean payoffs are
initially unknown, they must be learned from data. How long does it take to
learn about the mean reward of an action? In this section, after introducing
the notion of tail probabilities, we look at ways of obtaining upper bounds on
them. The main point is to introduce subgaussian random variables and the
Cramer-Chernoff (exponential) tail inequalities, which will play a central role in
the design and analysis of the various bandit algorithms.
5.1 Tail probabilities
Suppose thatX,X 1 ,X 2 ,...,Xnis a sequence of independent and identically
distributed random variables and assume that the meanμ=E[X] and variance
σ^2 =V[X] exist. Having observedX 1 ,X 2 ,...,Xnwe would like to estimate the
common meanμ. The most natural estimator is
μˆ=
1
n
∑n
i=1
Xi,
which is called thesample meanorempirical mean. Linearity of expectation
(Proposition 2.6) shows thatE[μˆ] =μ, which means thatμˆis anunbiased
estimator ofμ. How far fromμdo we expectμˆto be? A simple measure
of the spread of the distribution of a random variable Z is its variance,
V[Z] =E
[
(Z−E[Z])^2
]
. A quick calculation using independence shows that
V[ ˆμ] =E
[
(ˆμ−μ)^2
]
=
σ^2
n
, (5.1)
which means that we expect the squared distance betweenμandμˆto shrink as
ngrows large at a rate of 1/nand scale linearly with the variance ofX. While
the expected squared error is important, it does not tell us very much about the
distribution of the error. To do this we usually analyze the probability thatˆμ
overestimates or underestimatesμby more than some valueε >0. Precisely, how