Bandit Algorithms

(Jeff_L) #1
18.4 Regret analysis 219

expected regret of Exp4 defined in Algorithm 11 afternrounds. Then,


Rn≤


2 nklog(M). (18.7)
After translating the notation, the proof of the following lemma can be extracted
from the analysis of Exp3 in the proof of Theorem 11.2 (Exercise 18.2).

Lemma18.2.For anym∗∈[M], it holds that
∑n

t=1

X ̃tm∗−

∑n

t=1

∑M


m=1

QtmX ̃tm≤

log(M)
η

+


η
2

∑n

t=1

∑M


m=1

Ptm(1−Xˆtm)^2.

Proof of Theorem 18.1 Let Ft = σ(E(1),A 1 ,E(2),A 2 ,...,At− 1 ,E(t)) and
abbreviateEt[·] =E[·|Ft]. Letm∗be the index of the best performing expert in
hindsight:

m∗= argmaxm∈[M]

∑n

t=1

Em(t)xt, (18.8)

which is not random by the assumption that the experts are oblivious. Applying
Lemma 18.2 shows that
∑n


t=1

X ̃tm∗−

∑n

t=1

∑M


m=1

QtmX ̃tm≤log(M)
η


2

∑M


t=1

∑M


m=1

Qtm(1−X ̃tm)^2. (18.9)

Whenγ= 0 the estimatorXˆtiis unbiased so thatEt[Xˆt] =xtand


Et[X ̃t] =Et[E(t)Xˆt] =E(t)E[Xˆt] =E(t)xt. (18.10)

Taking the expectation of both sides of Eq. (18.9) and using the tower rule for
conditional expectation and the fact thatQtisFt-measurable leads to


Rn≤

log(M)
η

+


η
2

∑n

t=1

∑M


m=1

E


[


Qtm(1−X ̃tm)^2

]


. (18.11)


Like in Chapter 11, it is more convenient to work with losses. LetYˆti= 1−Xˆti,
yti= 1−xtiandY ̃tm= 1−X ̃tm. Note thatY ̃t=E(t)Yˆtand recall the notation
Ati=I{At=i}, which means thatYˆti=AtiPtiyti and

Et[Y ̃tm^2 ] =Et



(


EmA(t)tytAt
PtAt

) 2 


=


∑k

i=1

(


E(mit)yti

) 2


Pti


∑k

i=1

E(mit)
Pti

. (18.12)


Therefore, using the definition ofPti,


E


[M



m=1

Qtm(1−X ̃tm)^2

]


≤E


[M



m=1

Qtm

∑k

i=1

Emi(t)
Pti

]


=E


[k

i=1

∑M


m=1QtmE

(t)
mi
Pti

]


=k.
Free download pdf