Bandit Algorithms

(Jeff_L) #1
32.3 Regret analysis 371

Proof of Theorem 32.2 The first step in the proof is an upper bound on the
expected number of clicks in the optimal lista∗. Fix timet, blockPtdand recall
thatItd∗ =minPtdis the most attractive item inPtd. Letk=A−t^1 (Itd∗) be the
position of itemItd∗ andσbe the permutation that exchanges itemskandItd∗.
By Lemma 32.6, on the eventFtcwe haveItd∗ ≤k. From Parts (c) and (d) of
Assumption 32.1 we havev(At,k)≥v(σ◦At,k)≥v(a∗,k). Hence on the event
Ftcthe expected number of clicks onI∗tdis bounded from below by those on items
ina∗,


Et− 1

[


CtI∗td

]


=



k∈Itd

Pt− 1 (A−t^1 (Itd∗) =k)Et− 1 [v(At,k)|A−t^1 (I∗td) =k]

=


1


|Itd|


k∈Itd

Et− 1 [v(At,k)|A−t^1 (Itd∗) =k]≥

1


|Itd|


k∈Itd

v(a∗,k),

where we also used the fact that TopRank randomizes within each block to
guarantee thatPt− 1 (A−t^1 (I∗td) =k) = 1/|Itd|for anyk∈Itd. Using this and the
design of TopRank,
∑m


k=1

v(a∗,k) =

∑Mt

d=1


k∈Itd

v(a∗,k)≤

∑Mt

d=1

|Itd|Et− 1

[


CtItd∗

]


.


Therefore under eventFtcthe conditional expected regret in roundtis bounded
by


∑m

k=1

v(a∗,k)−Et− 1



∑`


j=1

Ctj


≤Et− 1



∑Mt

d=1

|Ptd|CtItd∗−

∑`


j=1

Ctj



=Et− 1



∑Mt

d=1


j∈Ptd

(CtI∗td−Ctj)



=


∑Mt

d=1


j∈Ptd

Et− 1 [UtI∗tdj]


∑`


j=1

min{∑m,j− 1 }

i=1

Et− 1 [Utij]. (32.4)

The last inequality follows by noting thatEt− 1 [UtItd∗j]≤


∑min{m,j− 1 }
i=1 Et−^1 [Utij].
To see this use part (a) of Lemma 32.3 to show thatEt− 1 [Utij]≥0 fori < jand
Lemma 32.6 to show that whenI∗td> m, then neitherItd∗ norjare not shown to
the user in roundtso thatUtI∗tdj= 0. Substituting the bound in Eq. (32.4) into
the regret leads to


Rn≤nmP(Fn) +

∑`


j=1

min{∑m,j− 1 }

i=1

E[I{Fnc}Snij], (32.5)

where we used the fact that the maximum number of clicks overnrounds is

Free download pdf