32.4 Notes 373modelan adversary secretly chooses a sequence of setsS 1 ,...,SnwithSt⊆[`].
In each roundtthe learner choosesAt∈Aand receives a rewardXt(At) where
Xt:A→[0,1] is given byXt(a) =I{St∩{a(1),...,a(k)}6=∅}. The feedback
is the position of the clicked action, which isMt=min{k∈[m] :At(k)∈St}.
The regret isRn=∑nt=1(Xt(a∗)−Xt(At)),wherea∗is the optimal ranking in hindsight:a∗= argmina∈A∑nt=1Xt(a). (32.6)Notice that this is the same as the cascade model whenSt={i:Zti= 1}.
5 A challenge in the ranked bandit model is that solving the offline problem (Eq.
32.6) for knownS 1 ,...,Snis NP-hard. How can one learn when finding an
optimal solution to the offline problem is hard? First, hardness only matters
if|A|is large. When`andmare not too large, then exhaustive search is
quite feasible. If this is not an option one may use an approximation algorithm.
It turns out that in a certain sense the best one can do is to use a greedy
algorithm, We omit the details, but the highlight is that there exist efficient
algorithms such that
E
[n
∑t=1Xt(At)]
≥
(
1 −
1
e)
maxa∈A∑nt=1Xt(a)−O(
m√
n`log(`))
.
See the article by Radlinski et al. [2008] for more details.
6 By modifying the reward function one can also define an adversarial variant
of the document-based model. As in the previous note, the adversary secretly
choosesS 1 ,...,Snas subsets of [`], but now the reward is
Xt(a) =|St∩{a(1),...,a(k)}|.The feedback is the positions of the clicked items,St∩{a(1),...,a(k)}. For this
model there are no computation issues. In fact, the problem can be analyzed
using a reduction to combinatorial semibandits, which we ask you to investigate
in Exercise 32.3.
7 The position-based model can also be modelled in the adversarial setting by
lettingStk⊂[`] for eacht∈[n] andk∈[m]. Then defining the reward by
Xt(a) =∑mk=1I{At(k)∈Stk}.Again, the feedback is the positions of the clicked items,{k∈[m] :At(k)∈Stk}.
This model can also be tackled using algorithms for combinatorial semibandits
(Exercise 32.4).