Bandit Algorithms

7.4 Exercises 110

(d)Suppose thatν= (νi)ki=1is a bandit whereSupp(νi)⊂[0,b] and the variance
of theith arm isσ^2 i. Design a policy that depends onb, but notσ^2 isuch that

Rn≤C

∑

i:∆i> 0

(

∆i+

(

b+

σi^2 ∆i

)

log(n)

)

,

whereC >0 is a universal constant.

If you did things correctly, then the policy you derived in Exercise 7.6 should resemble UCB-V by Audibert et al. [2007]. The proof of the empirical Bernstein also appears there or in the papers by Mnih et al. [2008] and Maurer and Pontil [2009].

7.7(Median-of-means and bandits with finite variance) Letn∈N+and
(Ai)mi=1be a partition of [n] so that∪mi=1Ai= [n] andAi∩Aj=∅for alli 6 =j.
Suppose thatδ∈(0,1) andX 1 ,X 2 ,...,Xnis a sequence of independent random
variables with meanμand varianceσ^2. Themedian-of-means estimatorˆμM
ofμis the median ofμˆ 1 ,μˆ 2 ,...,μˆmwhereμˆi=

∑

t∈AiXt/|Ai|is the mean of the data in theith block.

(a)Show that ifm=

⌊

min

{

n 2 ,8 log

(

e^1 /^8 δ

)}⌋

andAiare chosen as equally sized as possible, then

P

(

μˆM+

√

192 σ^2 n

log

(

e^1 /^8 δ

)

≤μ

)

≤δ.

(b)Use the median-of-means estimator to design an upper confidence bound
algorithm such that for allν∈EVk(σ^2 )

Rn≤C

∑

i:∆i> 0

(

∆i+

σ^2 log(n) ∆i

)

,

whereC >0 is a universal constant.

Bandit Algorithms

∑

(

(

)

)

,

∑

⌊

{

(

)}⌋

P

(

√

(

)

)

∑

(

)

,

Get our desktop app

Company

Features

Documentation

Resources