Bandit Algorithms

(Jeff_L) #1
10.1 Concentration for sums of Bernoulli random variables 129

Proof We will again use the Cramer-Chernoff method. Letλ >0 be some
constant to be chosen later. Then

P(ˆμ≥μ+ε) =P

(


exp

(


λ

∑n

t=1

(Xt−μ)

)


≥exp (λnε)

)



E[exp (λ

∑n
t=1(Xt−μ))]
exp (λnε)
= (μexp(λ(1−μ−ε)) + (1−μ) exp(−λ(μ+ε)))n.

This expression is minimized byλ= log(μμ+(1ε−)(1μ−−εμ)). Therefore

P(ˆμ≥μ+ε)


(


μ

(


(μ+ε)(1−μ)
μ(1−μ−ε)

) 1 −μ−ε
+ (1−μ)

(


(μ+ε)(1−μ)
μ(1−μ−ε)

)−μ−ε)n

=


(


μ
μ+ε

(


(μ+ε)(1−μ)
μ(1−μ−ε)

) 1 −μ−ε)n

= exp (−nd(μ+ε,μ)).

The bound on the left tail is proven identically.

Using Pinsker’s inequality, it follows that P(ˆμ≥μ+ε),P(ˆμ≤μ−ε) ≤
exp(− 2 nε^2 ), which is the same as what can be obtained from Hoeffding’s lemma
(see(5.8)). Solvingexp(− 2 nε^2 ) =δwe recover the usual 1−δconfidence upper
bound. In fact, this cannot be improved whenμ≈ 1 /2, but the Chernoff bound
is much stronger whenμis close to either zero or one. Can we invert the Chernoff
tail bound to get confidence intervals which get tighter automatically asμ(orμˆ)
approaches zero or one? The following corollary shows how to do this.

Corollary10.4. Letμ,μ,nˆ be as above. Then for anya≥ 0 ,

P(d(ˆμ,μ)≥a,μˆ≤μ)≤exp(−na), (10.3)
and P(d(ˆμ,μ)≥a,μˆ≥μ)≤exp(−na). (10.4)

Furthermore, defining

U(a) = max{u∈[0,1] :d(ˆμ,u)≤a},
and L(a) = min{u∈[0,1] :d(ˆμ,u)≤a}.

ThenP(μ≥U(a))≤exp(−na)andP(μ≤L(a))≤exp(−na).


Proof First, we prove(10.3). Note thatd(·,μ) is decreasing on [0,μ], and thus,
for 0≤a≤d(0,μ),{d(μ,μˆ )≥a,μˆ≤μ}={μˆ≤μ−x,μˆ≤μ}={μˆ≤μ−x},
wherexis the unique solution tod(μ−x,μ) =aon [0,μ]. Hence, by Eq. (10.2)
of Lemma 10.3,P(d(ˆμ,μ)≥a,μˆ≤μ) ≤ exp(−na). Whena ≥ d(0,μ), the
inequality trivially holds. The proof of(10.4)is entirely analogous and hence
is omitted. For the second part of the corollary fixa and letU = U(a).
Free download pdf