Bandit Algorithms

36.2 Frequentist analysis 431

σ(A 1 ,X 1 ,...,At,Xt) be theσ-algebra generated by the interaction sequence by the end of roundt. Note thatUt(i) isFt− 1 -measurable. The Bayesian regret is

BRn=E

[n ∑

t=1

(μA∗−μAt)

]

=E

[n ∑

t=1

E[μA∗−μAt|Ft− 1 ]

]

The key insight (Exercise 36.3) is to notice that the definition of Thompson
sampling implies the conditional distributions ofA∗andAtgivenFt− 1 are the
same:

P(A∗=·|Ft− 1 ) =P(At=·|Ft− 1 ) a.s. (36.1)

Using the previous display,

Using the tower rule for expectation shows that

BRn=E

[n ∑

t=1

(μA∗−Ut(A∗)) +

∑n

t=1

(Ut(At)−μAt)

]

. (36.2)

On the eventEcthe terms inside the expectation are bounded by 2nwhile on the eventEthe first sum is negative and the second is bounded by

I{E}

∑n

t=1

(Ut(At)−μAt) =I{E}

∑n

t=1

∑k

i=1

I{At=i}(Ut(i)−μi)

≤

∑k

i=1

∑n

t=1

I{At=i}

√

8 log(1/δ) 1 ∨Ti(t−1)≤

∑k

i=1

∫Ti(n)

0

√

8 log(1/δ) s ds

=

∑k

i=1

√

32 Ti(n) log(1/δ)≤

√

32 nklog(1/δ).

The proof is completed by choosingδ=n−^2 and the fact thatP(Ec)≤nkδ.

36.2 Frequentist analysis

Bounding the frequentist regret of Thompson sampling is more technical than the
Bayesian regret. The trouble is the frequentist regret does not have an expectation
with respect to the prior, which means thatAtis not conditionally distributed in
the same way as the optimal action (which is not random). Thompson sampling
can be viewed as an instantiation of follow the perturbed leader, which we already
saw in action for adversarial combinatorial semibandits in Chapter 30. Here we
work with the stochastic setting, and consider the general form algorithm states
in Algorithm 23.

Bandit Algorithms

]

=E

]

]

. (36.2)

I{E}

≤

√

√

=

√

√

36.2 Frequentist analysis

Get our desktop app

Company

Features

Documentation

Resources