Bandit Algorithms

(Jeff_L) #1
6.4 Exercises 93

Hint For (a) start fromRn≤m∆ +n∆exp(−m∆^2 /2) and assume that the
second term here dominates the first term. Find ∆ maximizing the resulting regret
upper bound. Based on this proposemand verify that the starting assumption
has been met by your choice regardless the values ofnand ∆. For (b) think
about the simplest stopping policy and then make it robust by using confidence
intervals. Tune the failure probability. For (c) note that the regret can never be
larger thann∆.
6.6(Doubling trick) LetEbe an arbitrary set of bandits. Suppose you are
given a policyAdesigned forEthat accepts the horizonnas a parameter and
has a regret guarantee of

Rn≤fn(ν), ∀ν∈E,

wherefn:E →[0,∞) is a sequence of functions. The purpose of this exercise is
to analyze a meta-algorithm based on the so-calleddoubling trickthat converts
a policy depending on the horizon to a policy with similar guarantees that does
not. Letn 1 < n 2 < n 3 <···be a fixed sequence of integers and consider the
policy that runsAwith horizonn 1 until roundt=max{n,n 1 }. Then restarts
the algorithm with horizonn 2 untilt=min{n,n 1 +n 2 }. Then restarts again
with horizonn 3 untilt=min{n,n 1 +n 2 +n 3 }and so-on. Note thattis the real
time counter and is not reset on each restart.


(a)Letmax=min{:


∑`


i=1ni≥n}. Prove that the regret of the meta-algorithm
is at most

Rn≤

`∑max

`=1

fn`(ν).

(b)Suppose thatfn(ν)≤



n. Show that ifn`= 2`−^1 , then the regret of the
meta-algorithm is at most

Rn≤C


n,

whereC >0 is a carefully chosen universal constant.
(c)Suppose thatfn(ν) =g(ν)log(n) for some functiong:E →[0,∞). What is
the regret of the meta-algorithm ifn= 2−^1? Can you find a better choice
of (n)?
(d)In light of this idea, should we bother trying to design algorithms that do not
depend on the horizon? Are there any disadvantages to using the doubling
trick? If so, what are they?


6.7(ε-greedy) For this exercise assume the rewards are 1-subgaussian and
there arek ≥ 2 arms. The ε-greedy algorithm depends on a sequence of
parametersε 1 ,ε 2 ,.... First it chooses each arm once and subsequently chooses
At=argmaxiμˆi(t−1) with probability 1−εtand otherwise chooses an arm
uniformly at random.
Free download pdf