6.4 Exercises 92
is
√
2 /m-subgaussian. You should only use the definitions and the interaction
protocol, which states that
(a)P(At∈·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) =π(·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) a.s.
(b)P(Xt∈·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ,At) =PAt(·) a.s.
6.2(Minimax regret) Show that Eq. (6.6) implies the regret of an optimally
tuned ETC for subgaussian 2-armed bandits satisfiesRn≤∆ +C
√
nwhere
C >0 is a universal constant.
6.3(High probability bounds i) Fixδ∈(0,1). Modify the ETC algorithm to
depend onδand prove a bound on the pseudo-regretR ̄n=nμ∗−
∑n
t=1μAtof
ETC that holds with probability 1−δ.
6.4(High probability bounds ii) Fixδ∈(0,1). Prove a bound on the random
regretRˆn=nμ∗−
∑n
t=1Xtof ETC that holds with probability 1−δ. Compare
this to the bound derived for the pseudo-regret in the previous exercise. What
can you conclude?
6.5(Adaptive commitment times) Suppose that ETC interacts with a
2-armed 1-subgaussian bandit with meansμ 1 ,μ 2 ∈Rand ∆ =|μ 1 −μ 2 |.
(a)Find a choice ofmthat only depends on the horizonnand not ∆ such the
regret of Algorithm 1 is bounded by
Rn≤∆ +Cn^2 /^3 ,
whereC >0 is a universal constant.
(b)Now suppose the commitment time is allowed to be data-dependent, which
means the algorithm explores each arm alternately until some condition is
met and then commits to a single arm for the remainder. Design a condition
such that the regret of the resulting algorithm can be bounded by
Rn≤∆ +
Clogn
∆
, (6.7)
whereCis a universal constant. Your condition should only depend on the
observed rewards and the time horizon. It should not depend onμ 1 ,μ 2 or ∆.
(c)Show that any algorithm for which(6.7)holds also satisfiesRn ≤∆ +
C
√
nlog(n) for suitably chosen universal constantC >0.
(d)As for (b), but now the objective is to design a condition such that the regret
of the resulting algorithm is bounded by
Rn≤∆ +
Clog max
{
e,n∆^2
}
∆
, (6.8)
(e)Show that any algorithm for which(6.8)holds also satisfiesRn≤∆ +C
√
n
for suitably chosen universal constantC >0.