Bandit Algorithms

20.1 Martingales and the method of mixtures 241

the rewards are Bernoulli. We have now dropped the assumption that (At)∞t=1 are fixed in advance.

The assumption thatλ >0 ensures thatVt(λ) is invertible and allows us to relax the requirement that the actions spanRd. Notice also that in this section we allow the interaction sequence to be infinitely long.

Since we want exponentially decaying tail probabilities one is tempted to try the Cramer-Chernoff method:

P

(

‖ˆθt−θ∗‖Vt≥u

)

≤α>inf 0 E

[

exp

(

α‖θˆt−θ∗‖Vt−αu

)]

.

Sadly, we do not know how to bound this expectation. Can we still somehow use the Cramer-Chernoff method? LetSt=

∑t s=1ηsAsand notice that

1 2 ‖ θˆt−θ∗‖^2 Vt= max x∈Rd

(

〈x,St〉−

1

2 ‖x‖

2 Vt

)

.

The next lemma shows that the exponential of the term inside the maximum is a
supermartingale.

Lemma20.2. For allx∈Rdthe processMt(x) =exp(〈x,St〉−^12 ‖x‖^2 Vt)is a F-adapted supermartingale withM 0 (x) = 1.

Proof of Lemma 20.2 ThatMt(x) isFt-measurable for alltis immediate from
the definition. We need to show thatE[Mt(x)|Ft− 1 ]≤Mt− 1 (x) almost surely.
The fact that (ηt) is conditionally 1-subgaussian means that

E[exp (ηt〈x,At〉)|Ft− 1 ]≤exp

(

〈x,At〉^2 2

)

= exp

(

‖x‖^2 AtA> t 2

)

a.s.

Hence

E[Mt(x)|Ft− 1 ] =E

[

exp

(

〈x,St〉−^1 2

‖x‖^2 Vt

)∣∣

∣∣Ft− 1

]

=Mt− 1 (x)E

[

exp

(

ηt〈x,At〉−

1

2

‖x‖^2 AtA>t

)∣∣

∣∣Ft− 1

]

≤Mt− 1 (x) a.s.

Finally note thatM 0 (x) = 1 is immediate.

Combining the lemma and the linearization idea almost works. The Cramer-

Bandit Algorithms

P

(

)

[

(

)]

.

(

1

)

.

(

)

(

)

[

(

)∣∣

]

[

(

1

2

)∣∣

]

Get our desktop app

Company

Features

Documentation

Resources