Bandit Algorithms

(Jeff_L) #1
7.4 Exercises 110

(d)Suppose thatν= (νi)ki=1is a bandit whereSupp(νi)⊂[0,b] and the variance
of theith arm isσ^2 i. Design a policy that depends onb, but notσ^2 isuch that


Rn≤C


i:∆i> 0

(


∆i+

(


b+

σi^2
∆i

)


log(n)

)


,


whereC >0 is a universal constant.

If you did things correctly, then the policy you derived in Exercise 7.6
should resemble UCB-V by Audibert et al. [2007]. The proof of the empirical
Bernstein also appears there or in the papers by Mnih et al. [2008] and
Maurer and Pontil [2009].

7.7(Median-of-means and bandits with finite variance) Letn∈N+and
(Ai)mi=1be a partition of [n] so that∪mi=1Ai= [n] andAi∩Aj=∅for alli 6 =j.
Suppose thatδ∈(0,1) andX 1 ,X 2 ,...,Xnis a sequence of independent random
variables with meanμand varianceσ^2. Themedian-of-means estimatorˆμM
ofμis the median ofμˆ 1 ,μˆ 2 ,...,μˆmwhereμˆi=



t∈AiXt/|Ai|is the mean of
the data in theith block.

(a)Show that ifm=



min

{


n
2 ,8 log

(


e^1 /^8
δ

)}⌋


andAiare chosen as equally sized
as possible, then

P


(


μˆM+


192 σ^2
n

log

(


e^1 /^8
δ

)


≤μ

)


≤δ.

(b)Use the median-of-means estimator to design an upper confidence bound
algorithm such that for allν∈EVk(σ^2 )


Rn≤C


i:∆i> 0

(


∆i+

σ^2 log(n)
∆i

)


,


whereC >0 is a universal constant.
Free download pdf