Bandit Algorithms

36.5 Notes 441

andAtare conditionally independent givenA∗. Hence,

It− 1 (A∗;Ft) =Et− 1

[

E ̃t− 1

[

log

P ̃t− 1 (A∗=a|XtAt) ̃Pt− 1 (A∗=a)

∣∣

a=A∗

]]

=Et− 1

[

I ̃Pt− 1 (A∗;XtAt)

]

=Et− 1

[

D( ̃Pt− 1 (XtAt∈·|A∗), ̃Pt− 1 (XtAt∈·))

]

(Lemma 36.6) ≥ 2 Et− 1

[

(E ̃t− 1 [XtAt|A∗]−E ̃t− 1 [XtAt])^2

]

= 2

∑k

i,j=1

Pt− 1 (At=i,A∗=j)(Et− 1 [Xti|A∗=j]−Et− 1 [Xti])^2

≥ 2

∑k

i=1

Pt− 1 (At=i)^2 (Et− 1 [Xti|A∗=i]−Et− 1 [Xti])^2

≥

2

k

(k ∑

i=1

Pt− 1 (At=i) (Et− 1 [Xti|A∗=i]−Et− 1 [Xti])

) 2

=

2

k

Et− 1 [∆t]^2.

where the first inequality follows from Pinsker’s inequality (Eq. 14.11), the result in Exercise 14.4 and the assumption that the rewards lie in [0,1]. The second inequality follows by dropping cross terms and using again the independence of A∗andAtgivenFt− 1 , while the third is by Cauchy-Schwarz. The result follows by rearranging the above display. Proof of Theorem 36.7 The result follows by combining Lemma 36.9 and Theorem 36.8.

36.5 Notes

1 There are several equivalent ways to view Thompson sampling in stationary stochastic multi-armed bandits:(a)select an arm according to the posterior probability that the arm is optimal,(b)sample an environment from the posterior and play the optimal action in that environment. When the mean rewards for each arm are independent under the posterior, then also equivalent is(c)sample the mean reward for each arm and choose the arm with the largest mean (Exercise 36.1). The algorithms in this chapter are based on(b), but all are equivalent and simply correspond to sampling from different pushforward measures of the posterior. Historically it seems that Thompson [1933] had the form in(a)in mind, but there are reasons to keep remember the alternative views. Though we are not aware of an example, in some instances beyond finite-armed bandits it might be more computationally efficient to sample from a pushforward of the posterior than the posterior itself. Furthermore, in more complicated situations like reinforcement learning it may be desirable to

Bandit Algorithms

[

[

∣∣

∣∣

]]

[

]

[

]

[

]

= 2

≥ 2

≥

2

) 2

=

2

36.5 Notes

Get our desktop app

Company

Features

Documentation

Resources