Bandit Algorithms

9.1 The MOSS algorithm 119

Before the proof we state and prove a strengthened version of Corollary 5.5.

Theorem9.2. LetX 1 ,X 2 ,...,Xnbe a sequence of independentσ-subgaussian
random variables andSt=

∑t s=1Xs. Then for anyε >^0 ,

P(existst≤n:St≥ε)≤exp

(

−

ε^2 2 nσ^2

)

. (9.1)

The bound in Eq. (9.1) is the same as the bound onP(Sn≥ε)that appears in a simple reformulation of Corollary 5.5, so this new result is strictly stronger.

Proof From the definition of subgaussian random variables and Lemma 5.4,

E[exp (λSn)]≤exp

(

nσ^2 λ^2 2

)

.

Then choosingλ=ε/(nσ^2 ) leads to

P(existst≤n:St≥ε) =P

(

maxt≤nexp (λSt)≥exp (λε)

)

≤

E[exp (λSn)] exp (λε)

≤exp

(

nσ^2 λ^2 2

−λε

)

= exp

(

−

ε^2 2 nσ^2

)

.

The novel step is the first inequality, which follows from Doob’s submartingale
inequality (Theorem 3.10) and the fact that thatexp(λSt) is a submartingale
with respect to the filtration generated byX 1 ,X 2 ,...,Xn(Exercise 9.1).

Before the proof of Theorem 9.1 we need one more lemma to bound the probability that the index of the optimal arm ever drops too far below the actual mean of the optimal arm. The proof of this lemma relies on a tool called the peeling device, which is an important technique in probability theory and has many applications beyond bandits. For example, it can be used to prove the law of the iterated logarithm.

Lemma9.3.Letδ∈(0,1)andX 1 ,X 2 ,...be independent and 1 -subgaussian and μˆt=^1 t

∑t s=1Xs. Then for any∆>^0 ,

P

(

existss≥1 : ˆμs+

√

4

slog

+

(

1

sδ

)

+ ∆≤ 0

)

≤

15 δ ∆^2.

Bandit Algorithms

(

−

)

. (9.1)

(

)

.

(

)

≤

(

)

(

−

)

.

P

(

√

4

(

1

)

+ ∆≤ 0

)

≤

Get our desktop app

Company

Features

Documentation

Resources