7.4 Exercises 110
(d)Suppose thatν= (νi)ki=1is a bandit whereSupp(νi)⊂[0,b] and the variance
of theith arm isσ^2 i. Design a policy that depends onb, but notσ^2 isuch that
Rn≤C
∑
i:∆i> 0
(
∆i+
(
b+
σi^2
∆i
)
log(n)
)
,
whereC >0 is a universal constant.
If you did things correctly, then the policy you derived in Exercise 7.6
should resemble UCB-V by Audibert et al. [2007]. The proof of the empirical
Bernstein also appears there or in the papers by Mnih et al. [2008] and
Maurer and Pontil [2009].
7.7(Median-of-means and bandits with finite variance) Letn∈N+and
(Ai)mi=1be a partition of [n] so that∪mi=1Ai= [n] andAi∩Aj=∅for alli 6 =j.
Suppose thatδ∈(0,1) andX 1 ,X 2 ,...,Xnis a sequence of independent random
variables with meanμand varianceσ^2. Themedian-of-means estimatorˆμM
ofμis the median ofμˆ 1 ,μˆ 2 ,...,μˆmwhereμˆi=
∑
t∈AiXt/|Ai|is the mean of
the data in theith block.
(a)Show that ifm=
⌊
min
{
n
2 ,8 log
(
e^1 /^8
δ
)}⌋
andAiare chosen as equally sized
as possible, then
P
(
μˆM+
√
192 σ^2
n
log
(
e^1 /^8
δ
)
≤μ
)
≤δ.
(b)Use the median-of-means estimator to design an upper confidence bound
algorithm such that for allν∈EVk(σ^2 )
Rn≤C
∑
i:∆i> 0
(
∆i+
σ^2 log(n)
∆i
)
,
whereC >0 is a universal constant.