Bandit Algorithms

12.5 Exercises 166

(f)Combining the previous steps, show that there exists a universal constant C >0 such that for anyδ∈(0,1), for an appropriate choice ofη,γandβ, with probability at least 1−δit holds that the random regretRˆnof Exp3.P satisfies Rˆn≤C

√

nklog(k/δ)

(g) In which step did you use the modified estimators?
(h)Show a bound where the algorithm parametersη,γ,βcan only depend on
n,k, but not onδ.
(i)Compare the bounds with the analogous bounds for Exp3-IX in Theorem 12.1.

12.2 (Generic Exp3.P analysis) This exercise is concerned with a
generalization of the core idea underlying Exp3.P of the previous exercise in that
rather than giving explicit expressions for the biased loss estimates, we focus
on the key properties required by the analysis of Exp3.P. To reduce clutter we
assume for the remainder thattranges in [n] anda∈[k]. Let (Ω,F,F,P) be a
filtered probability space withF= (Ft)nt=0. Let (Zt), (Zˆt), (Z ̃t), (βt) be sequences
of random elements inRk, whereZ ̃t=Zˆt−βtand (Zt),(βt) areF-predictable,
whereas (Zˆt) and therefore also (Z ̃t) areF-adapted. You should think ofZˆtas
the estimate ofZtthat uses randomization, andβtis the bias as in the previous
exercise. Given positive constantηdefine the probability vectorPt∈Pk− 1 by

Pta=

exp

(

−η

∑t− 1 s=1Z ̃sa

)

∑k b=1exp

(

−η

∑t− 1 s=1Z ̃sb

).

LetEt− 1 [·] =E[·|Ft− 1 ]. Assume the following hold for alla∈[k]:

(a) η|Zˆta|≤ 1 , (b) ηβta≤ 1 , (c) ηEt− 1 [Zˆta^2 ]≤βtaalmost surely, (d) Et− 1 [Zˆta] =Ztaalmost surely.

LetA∗= argmina∈[k]

∑n t=1ZtaandRn=

∑n

t=1

∑k

a=1

Pta(Zta−ZtA∗).

(a) Show that

∑n

t=1

∑k

a=1

Pta(Zta−ZtA∗)

=

∑n

t=1

∑k

a=1

Pta(Z ̃ta−Z ̃tA∗) ︸︷︷︸ (A)

+

∑n

t=1

∑k

a=1

Pta(Zta−Z ̃ta) ︸︷︷︸ (B)

+

∑n

t=1

(Z ̃tA∗−ZtA∗) ︸︷︷︸ (C)

.

(b) Show that

(A)≤

log(k) η

+η

∑n

t=1

∑k

a=1

PtaZˆta^2 + 3

∑n

t=1

∑k

a=1

Ptaβta.

Bandit Algorithms

√

(

)

(

).

=

+

+

.

(A)≤

Get our desktop app

Company

Features

Documentation

Resources