Bandit Algorithms

31.2 Stochastic bandits 357

If the locations of the change points were known then running a new copy of UCB on each interval would lead to a bound of

Rn(μ) =O

(

mk ∆min

log

(n m

))

,

where ∆minis the smallest suboptimality gap over allmblocks. In the last section
we saw that the bound achieved by an omniscient policy that knows when the
changes occur can be achieved by a policy that does not. Unfortunately this is
not true here.

Theorem31.2. Letk= 2and suppose thatμi(t) =μiis constant for both
arms and∆ =μ 1 −μ 2 > 0. Letπbe a policy such that its expected regretRn(μ)
on banditμsatisfiesRn(μ) =o(nα)with 0 < α≤ 1. Then for all sufficiently
largenthere exists a nonstationary banditμ′with at most two change points
andmint∈[n]|μ′ 1 (t)−μ′ 2 (t)|≥∆such thatRn(μ′)≥cn/Rn(μ), wherec > 0 is a
constant that depends only onα > 0.

The theorem shows that if a policy enjoysRn(μ) =o(n^1 /^2 ) for any nontrivial
(stationary) bandit, then its minimax regret is at least ω(n^1 /^2 ) on some
nonstationary bandit. In particular, ifRn(μ) =O(log(n)), then the minimax
regret is at least Ω(n/log(n)). This dashes our hopes for a policy that is much
better than Exp4 in a stochastic setting. There are algorithms designed for
nonstationary bandits in the stochastic setting with abrupt change points as
described above. Those that come with theoretical guarantees are based on
forgetting or discounting data so that decisions of the algorithm depend almost
entirely on recent data. In the notes we discuss these approaches along with
alternative models for nonstationarity.

Proof of Theorem 31.2 Let (Sj)Lj=1be a partition of [n] to be specified later. LetPandE[·] denote the probabilities and expectations with respect to the bandit determined byμandP′with respect to alternative nonstationary bandit μ′to be defined shortly. By the pigeonhole principle there exists aj∈[L] such that

E





∑

t∈Sj

I{At= 2}



≤E[T^2 (n)] L

.

Define an alternative nonstationary bandit withμ′(t) =μexcept fort∈Sj
when we letμ′ 2 (t) =μ 2 +εwhereε=

√

2 L/E[T 2 (n)]. Then by Lemma 15.1 and
Theorem 14.2,

P





∑

t∈Sj

I{At= 2}≥

|Sj| 2



+P′





∑

t∈Sj

I{At= 2}<

|Sj| 2



≥^1

2

exp (−D(P,P′))

≥

1

2

exp

(

−

E[T 2 (n)]ε^2 2 L

)

≥

1

2 e

.

Bandit Algorithms

(

))

,

E





∑



.

√

P





∑



+P′





∑



≥^1

2

≥

1

2

(

−

)

≥

1

.

Get our desktop app

Company

Features

Documentation

Resources