Bandit Algorithms

31.3 Notes 359

each arm once and subsequently

At= argmaxi∈[k]



μˆγi(t−1) +

√√

√√ α Tiγ(t−1)

log

(k ∑

i=1

Tiγ(t−1)

)

.

The idea is to ‘discount’ rewards that occurred far in the past, which makes the algorithm most influenced by recent events. A similar algorithm called Sliding-Window UCB uses a similar approach, but rather than discounting past rewards with a geometric discount function it simply discards them altogether. Letτ∈N+be a constant and define

μˆτi(t) =

∑t

s=t−τ+1

I{As=i}Xs Tiτ(t) =

∑t

s=t−τ+1

I{As=i}.

Then the Sliding-Window UCB chooses

At= argmaxi∈[k]

(

ˆμτi(t−1) +

√ α

Tiτ(t−1)

log(t∧τ)

)

.

It is known that ifγorτare tuned appropriately, then for Discounted UCB the regret may be bounded byO(

√

nmlog(n)) and for Sliding-Window UCB byO(

√

nmlog(n)). Neither bound improves on what is available using Exp4,
but there is some empirical evidence to support the use of these algorithms
when the stochastic assumption holds. It is not known whether these bounds
are tight.
3 An alternative way to model nonstationary stochastic bandits is to assume the
mean payoffs of the arms are slowly drifting. One way to do this is to assume
thatμi(t) follows a reflected Brownian motion in some interval. It is not hard
to see that the regret is necessary linear in this case because the best arm
changes in any round with constant probability. The objective in this case is
to understand the magnitude of the linear regret in terms of the size of the
interval or volatility of the Brownian motion.
4 Yet another idea is to allow the means to change in an arbitrary way, but
restrict the amount of variation. Letμt= (μ 1 (t),...,μk(t)) and

Vn=

n∑− 1

t=1

‖μt−μt+1‖∞

be the cumulative change in mean rewards measured in terms of the supremum norm. Then for eachV ∈[1/k,n/k] there exists a policy such that for all bandits withVn≤V it holds that Rn≤C(V klog(k))^1 /^3 n^2 /^3. (31.4)

This bound is nearly tight in a minimax sense. The lower bound is obtained by partitioning [n] intomparts where in each part all arms have equal means except for the optimal arm, which is better by ∆ =c

√

mk/nfor universal constantc∈R. The usual argument shows that the total regret is Ω(

√

kmn)

Bandit Algorithms



√√

)

.

(

)

.

√

√

√

√

Get our desktop app

Company

Features

Documentation

Resources