Bandit Algorithms

18.4 Regret analysis 219

expected regret of Exp4 defined in Algorithm 11 afternrounds. Then,

Rn≤

√

2 nklog(M). (18.7) After translating the notation, the proof of the following lemma can be extracted from the analysis of Exp3 in the proof of Theorem 11.2 (Exercise 18.2).

Lemma18.2.For anym∗∈[M], it holds that ∑n

t=1

X ̃tm∗−

∑n

t=1

∑M

m=1

QtmX ̃tm≤

log(M) η

+

η 2

∑n

t=1

∑M

m=1

Ptm(1−Xˆtm)^2.

Proof of Theorem 18.1 Let Ft = σ(E(1),A 1 ,E(2),A 2 ,...,At− 1 ,E(t)) and abbreviateEt[·] =E[·|Ft]. Letm∗be the index of the best performing expert in hindsight:

m∗= argmaxm∈[M]

∑n

t=1

Em(t)xt, (18.8)

which is not random by the assumption that the experts are oblivious. Applying
Lemma 18.2 shows that
∑n

t=1

X ̃tm∗−

∑n

t=1

∑M

m=1

QtmX ̃tm≤log(M) η

+η 2

∑M

t=1

∑M

m=1

Qtm(1−X ̃tm)^2. (18.9)

Whenγ= 0 the estimatorXˆtiis unbiased so thatEt[Xˆt] =xtand

Et[X ̃t] =Et[E(t)Xˆt] =E(t)E[Xˆt] =E(t)xt. (18.10)

Taking the expectation of both sides of Eq. (18.9) and using the tower rule for
conditional expectation and the fact thatQtisFt-measurable leads to

Rn≤

log(M) η

+

η 2

∑n

t=1

∑M

m=1

E

[

Qtm(1−X ̃tm)^2

]

. (18.11)

Like in Chapter 11, it is more convenient to work with losses. LetYˆti= 1−Xˆti, yti= 1−xtiandY ̃tm= 1−X ̃tm. Note thatY ̃t=E(t)Yˆtand recall the notation Ati=I{At=i}, which means thatYˆti=AtiPtiyti and

Et[Y ̃tm^2 ] =Et





(

EmA(t)tytAt PtAt

) 2 

=

∑k

i=1

(

E(mit)yti

) 2

Pti

≤

∑k

i=1

E(mit) Pti

. (18.12)

Therefore, using the definition ofPti,

E

[M

∑

m=1

Qtm(1−X ̃tm)^2

]

≤E

[M

∑

m=1

Qtm

∑k

i=1

Emi(t) Pti

]

=E

[k ∑

i=1

∑M

m=1QtmE

(t) mi Pti

]

=k.

Bandit Algorithms

√

∑M

+

∑M

∑M

∑M

∑M

+

∑M

E

[

]

. (18.11)





(

) 2 

=

(

) 2

≤

. (18.12)

E

[M

∑

]

≤E

[M

∑

]

=E

∑M

]

Get our desktop app

Company

Features

Documentation

Resources