Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

10 The Upper Confidence Bound Algorithm: Bernoulli Noise ( )


In previous chapters we assumed that the noise of the rewards wasσ-subgaussian
for some knownσ >0. This has the advantage of simplicity and relative generality,
but stronger assumptions are sometimes justified and often lead to stronger results.
In this chapter the rewards are assumed to be Bernoulli, which just means that
Xt∈ { 0 , 1 }. This is a fundamental setting found in many applications. For
example, in click-through prediction the user either clicks on the link or not. A
Bernoulli bandit is characterized by the mean payoff vectorμ∈[0,1]kand the
reward observed in roundtisXt∼B(μAt).
The Bernoulli distribution is 1/2-subgaussian regardless of its mean
(Exercise 5.12). Hence the results of the previous chapters are applicable and an
appropriately tuned UCB enjoys logarithmic regret. The additional knowledge
that the rewards are Bernoulli is not being fully exploited by these algorithms,
however. The reason is essentially that the variance of a Bernoulli random
variable depends on its mean, and when the variance is small the empirical mean
concentrates faster, a fact that should be used to make the confidence intervals
smaller.

10.1 Concentration for sums of Bernoulli random variables


The first step when designing a new optimistic algorithm is to construct confidence
sets for the unknown parameters. For Bernoulli bandits this corresponds to
analyzing the concentration of the empirical mean for sums of Bernoulli random
variables.

Definition 10.1 (Relative entropy between Bernoulli distributions). The
relative entropybetween Bernoulli distributions with parametersp,q∈[0,1] is

d(p,q) =plog(p/q) + (1−p) log((1−p)/(1−q)),

where singularities are defined by taking limits:d(0,q) =log(1/(1−q)) and
d(1,q) =log(1/q) forq∈[0,1] andd(p,0) = 0 ifp= 0 and∞otherwise and
d(p,1) = 0 ifp= 1 and∞otherwise.
Free download pdf