Bandit Algorithms

6.4 Exercises 92

is

√

2 /m-subgaussian. You should only use the definitions and the interaction protocol, which states that

(a)P(At∈·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) =π(·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) a.s.
(b)P(Xt∈·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ,At) =PAt(·) a.s.

6.2(Minimax regret) Show that Eq. (6.6) implies the regret of an optimally tuned ETC for subgaussian 2-armed bandits satisfiesRn≤∆ +C

√

nwhere C >0 is a universal constant.

6.3(High probability bounds i) Fixδ∈(0,1). Modify the ETC algorithm to depend onδand prove a bound on the pseudo-regretR ̄n=nμ∗−

∑n t=1μAtof ETC that holds with probability 1−δ.

6.4(High probability bounds ii) Fixδ∈(0,1). Prove a bound on the random regretRˆn=nμ∗−

∑n t=1Xtof ETC that holds with probability 1−δ. Compare this to the bound derived for the pseudo-regret in the previous exercise. What can you conclude?

6.5(Adaptive commitment times) Suppose that ETC interacts with a 2-armed 1-subgaussian bandit with meansμ 1 ,μ 2 ∈Rand ∆ =|μ 1 −μ 2 |.

(a)Find a choice ofmthat only depends on the horizonnand not ∆ such the
regret of Algorithm 1 is bounded by

Rn≤∆ +Cn^2 /^3 ,

whereC >0 is a universal constant.

(b)Now suppose the commitment time is allowed to be data-dependent, which
means the algorithm explores each arm alternately until some condition is
met and then commits to a single arm for the remainder. Design a condition
such that the regret of the resulting algorithm can be bounded by

Rn≤∆ +

Clogn ∆

, (6.7)

whereCis a universal constant. Your condition should only depend on the observed rewards and the time horizon. It should not depend onμ 1 ,μ 2 or ∆. (c)Show that any algorithm for which(6.7)holds also satisfiesRn ≤∆ + C

√

nlog(n) for suitably chosen universal constantC >0.
(d)As for (b), but now the objective is to design a condition such that the regret
of the resulting algorithm is bounded by

Rn≤∆ +

Clog max

{

e,n∆^2

}

∆

, (6.8)

(e)Show that any algorithm for which(6.8)holds also satisfiesRn≤∆ +C

√

n for suitably chosen universal constantC >0.

Bandit Algorithms

√

√

, (6.7)

√

{

}

∆

, (6.8)

√

Get our desktop app

Company

Features

Documentation

Resources