Bandit Algorithms

10.1 Concentration for sums of Bernoulli random variables 129

Proof We will again use the Cramer-Chernoff method. Letλ >0 be some constant to be chosen later. Then

P(ˆμ≥μ+ε) =P

(

exp

(

λ

∑n

t=1

(Xt−μ)

)

≥exp (λnε)

)

≤

E[exp (λ

∑n t=1(Xt−μ))] exp (λnε) = (μexp(λ(1−μ−ε)) + (1−μ) exp(−λ(μ+ε)))n.

This expression is minimized byλ= log(μμ+(1ε−)(1μ−−εμ)). Therefore

P(ˆμ≥μ+ε)

≤

(

μ

(

(μ+ε)(1−μ) μ(1−μ−ε)

) 1 −μ−ε + (1−μ)

(

(μ+ε)(1−μ) μ(1−μ−ε)

)−μ−ε)n

=

(

μ μ+ε

(

(μ+ε)(1−μ) μ(1−μ−ε)

) 1 −μ−ε)n

= exp (−nd(μ+ε,μ)).

The bound on the left tail is proven identically.

Using Pinsker’s inequality, it follows that P(ˆμ≥μ+ε),P(ˆμ≤μ−ε) ≤ exp(− 2 nε^2 ), which is the same as what can be obtained from Hoeffding’s lemma (see(5.8)). Solvingexp(− 2 nε^2 ) =δwe recover the usual 1−δconfidence upper bound. In fact, this cannot be improved whenμ≈ 1 /2, but the Chernoff bound is much stronger whenμis close to either zero or one. Can we invert the Chernoff tail bound to get confidence intervals which get tighter automatically asμ(orμˆ) approaches zero or one? The following corollary shows how to do this.

Corollary10.4. Letμ,μ,nˆ be as above. Then for anya≥ 0 ,

P(d(ˆμ,μ)≥a,μˆ≤μ)≤exp(−na), (10.3) and P(d(ˆμ,μ)≥a,μˆ≥μ)≤exp(−na). (10.4)

Furthermore, defining

U(a) = max{u∈[0,1] :d(ˆμ,u)≤a}, and L(a) = min{u∈[0,1] :d(ˆμ,u)≤a}.

ThenP(μ≥U(a))≤exp(−na)andP(μ≤L(a))≤exp(−na).

Proof First, we prove(10.3). Note thatd(·,μ) is decreasing on [0,μ], and thus, for 0≤a≤d(0,μ),{d(μ,μˆ )≥a,μˆ≤μ}={μˆ≤μ−x,μˆ≤μ}={μˆ≤μ−x}, wherexis the unique solution tod(μ−x,μ) =aon [0,μ]. Hence, by Eq. (10.2) of Lemma 10.3,P(d(ˆμ,μ)≥a,μˆ≤μ) ≤ exp(−na). Whena ≥ d(0,μ), the inequality trivially holds. The proof of(10.4)is entirely analogous and hence is omitted. For the second part of the corollary fixa and letU = U(a).

Bandit Algorithms

(

(

)

)

≤

(

(

(

=

(

(

Get our desktop app

Company

Features

Documentation

Resources