Bandit Algorithms

(Jeff_L) #1
20.1 Martingales and the method of mixtures 241

the rewards are Bernoulli. We have now dropped the assumption that (At)∞t=1
are fixed in advance.

The assumption thatλ >0 ensures thatVt(λ) is invertible and allows us
to relax the requirement that the actions spanRd. Notice also that in this
section we allow the interaction sequence to be infinitely long.

Since we want exponentially decaying tail probabilities one is tempted to try
the Cramer-Chernoff method:

P


(


‖ˆθt−θ∗‖Vt≥u

)


≤α>inf 0 E

[


exp

(


α‖θˆt−θ∗‖Vt−αu

)]


.


Sadly, we do not know how to bound this expectation. Can we still somehow use
the Cramer-Chernoff method? LetSt=

∑t
s=1ηsAsand notice that

1
2 ‖
θˆt−θ∗‖^2 Vt= max
x∈Rd

(


〈x,St〉−

1


2 ‖x‖

2
Vt

)


.


The next lemma shows that the exponential of the term inside the maximum is a
supermartingale.


Lemma20.2. For allx∈Rdthe processMt(x) =exp(〈x,St〉−^12 ‖x‖^2 Vt)is a
F-adapted supermartingale withM 0 (x) = 1.

Proof of Lemma 20.2 ThatMt(x) isFt-measurable for alltis immediate from
the definition. We need to show thatE[Mt(x)|Ft− 1 ]≤Mt− 1 (x) almost surely.
The fact that (ηt) is conditionally 1-subgaussian means that


E[exp (ηt〈x,At〉)|Ft− 1 ]≤exp

(


〈x,At〉^2
2

)


= exp

(


‖x‖^2 AtA>
t
2

)


a.s.

Hence

E[Mt(x)|Ft− 1 ] =E

[


exp

(


〈x,St〉−^1
2

‖x‖^2 Vt

)∣∣


∣∣Ft− 1

]


=Mt− 1 (x)E

[


exp

(


ηt〈x,At〉−

1


2


‖x‖^2 AtA>t

)∣∣


∣∣Ft− 1

]


≤Mt− 1 (x) a.s.

Finally note thatM 0 (x) = 1 is immediate.

Combining the lemma and the linearization idea almost works. The Cramer-
Free download pdf