Bandit Algorithms

32.3 Regret analysis 371

Proof of Theorem 32.2 The first step in the proof is an upper bound on the
expected number of clicks in the optimal lista∗. Fix timet, blockPtdand recall
thatItd∗ =minPtdis the most attractive item inPtd. Letk=A−t^1 (Itd∗) be the
position of itemItd∗ andσbe the permutation that exchanges itemskandItd∗.
By Lemma 32.6, on the eventFtcwe haveItd∗ ≤k. From Parts (c) and (d) of
Assumption 32.1 we havev(At,k)≥v(σ◦At,k)≥v(a∗,k). Hence on the event
Ftcthe expected number of clicks onI∗tdis bounded from below by those on items
ina∗,

Et− 1

[

CtI∗td

]

=

∑

k∈Itd

Pt− 1 (A−t^1 (Itd∗) =k)Et− 1 [v(At,k)|A−t^1 (I∗td) =k]

=

1

|Itd|

∑

k∈Itd

Et− 1 [v(At,k)|A−t^1 (Itd∗) =k]≥

1

|Itd|

∑

k∈Itd

v(a∗,k),

where we also used the fact that TopRank randomizes within each block to
guarantee thatPt− 1 (A−t^1 (I∗td) =k) = 1/|Itd|for anyk∈Itd. Using this and the
design of TopRank,
∑m

k=1

v(a∗,k) =

∑Mt

d=1

∑

k∈Itd

v(a∗,k)≤

∑Mt

d=1

|Itd|Et− 1

[

CtItd∗

]

.

Therefore under eventFtcthe conditional expected regret in roundtis bounded
by

∑m

k=1

v(a∗,k)−Et− 1





∑`

j=1

Ctj



≤Et− 1





∑Mt

d=1

|Ptd|CtItd∗−

∑`

j=1

Ctj





=Et− 1





∑Mt

d=1

∑

j∈Ptd

(CtI∗td−Ctj)





=

∑Mt

d=1

∑

j∈Ptd

Et− 1 [UtI∗tdj]

≤

∑`

j=1

min{∑m,j− 1 }

i=1

Et− 1 [Utij]. (32.4)

The last inequality follows by noting thatEt− 1 [UtItd∗j]≤

∑min{m,j− 1 }
i=1 Et−^1 [Utij].
To see this use part (a) of Lemma 32.3 to show thatEt− 1 [Utij]≥0 fori < jand
Lemma 32.6 to show that whenI∗td> m, then neitherItd∗ norjare not shown to
the user in roundtso thatUtI∗tdj= 0. Substituting the bound in Eq. (32.4) into
the regret leads to

Rn≤nmP(Fn) +

∑`

j=1

min{∑m,j− 1 }

i=1

E[I{Fnc}Snij], (32.5)

where we used the fact that the maximum number of clicks overnrounds is

Bandit Algorithms

[

]

=

∑

=

1

∑

1

∑

∑

[

]

.





∑`







∑`









∑





=

∑

≤

∑`

∑`

Get our desktop app

Company

Features

Documentation

Resources