11.7 Exercises 156
and define 2-armed adversarial bandit in terms of its losses by
yt 1 =
{
0 ift≤n/ 2
1 otherwise
and yt 2 =
{
α ift≤n/ 2
0 otherwise.
For simplicity you may assume thatnis even.
(a)Define sequence of real-valued functionsq 1 ,...,qnon domain [1/ 4 , 1 /2]
inductively byq 0 (α) = 1/2 and
qs+1(α) =
qs(α) exp(−ηα/qs(α))
1 −qs(α) +qs(α) exp(−ηα/qs(α))
.
Show fort≤1 +n/2 thatPt 2 =qT 2 (t−1)(α), whereT 2 (t) =
∑t
s=1As^2.
(b)Show that for sufficiently largenthere exists anα∈[1/ 4 , 1 /2] ands∈N
such that
qs(α) =^1
8 n
and
∑s−^1
u=1
1
qu(α)
≤n
8
.
(c) Prove thatP(T 2 (n/2)≥s+ 1)≥ 1 /65.
(d) Prove thatP(Rˆn≥n/4)≥(1−nexp(−ηn)/2)/65.
(e)The previous part shows that the regret is linear with constant probability
for sufficiently largen. On the other hand, a dubious application of Markov’s
inequality and Theorem 11.1 shows that
P(Rˆn≥n/4)≤^4 E[
Rˆn]
n
=O(n−^1 /^2 ).
Explain the apparent contradiction.
(f)Find a choice ofα∈[1/ 4 , 1 /2] that satisfies the conditions of the theorem
forn= 10^5 and some sensible learning rate. Then demonstrate empirically
that Exp3 indeed suffers from linear regret with constant probability on this
problem.
11.6(Gumbel trick) Leta 1 ,...,akbe positive real values andU 1 ,...,Ukbe
a sequence of independent and identically distributed uniform random variables
on [0,1]. Then letGi=−log(−log(Ui)), which follows astandard Gumbel
distribution. Prove that
P
(
log(ai) +Gi= max
j∈[k]
(log(aj) +Gj)
)
=
ai
∑k
j=1aj
.
11.7(Exp3 as follow the perturbed leader) Let (Zti)tibe a collection
of independent and identically distributed random variables. The follow the
perturbed leader algorithm chooses
At= argmaxi∈[k]
(
Zti−η
t∑− 1
s=1
Yˆsi