32.4 Notes 373
modelan adversary secretly chooses a sequence of setsS 1 ,...,SnwithSt⊆[`].
In each roundtthe learner choosesAt∈Aand receives a rewardXt(At) where
Xt:A→[0,1] is given byXt(a) =I{St∩{a(1),...,a(k)}6=∅}. The feedback
is the position of the clicked action, which isMt=min{k∈[m] :At(k)∈St}.
The regret is
Rn=
∑n
t=1
(Xt(a∗)−Xt(At)),
wherea∗is the optimal ranking in hindsight:
a∗= argmina∈A
∑n
t=1
Xt(a). (32.6)
Notice that this is the same as the cascade model whenSt={i:Zti= 1}.
5 A challenge in the ranked bandit model is that solving the offline problem (Eq.
32.6) for knownS 1 ,...,Snis NP-hard. How can one learn when finding an
optimal solution to the offline problem is hard? First, hardness only matters
if|A|is large. When`andmare not too large, then exhaustive search is
quite feasible. If this is not an option one may use an approximation algorithm.
It turns out that in a certain sense the best one can do is to use a greedy
algorithm, We omit the details, but the highlight is that there exist efficient
algorithms such that
E
[n
∑
t=1
Xt(At)
]
≥
(
1 −
1
e
)
maxa∈A
∑n
t=1
Xt(a)−O
(
m
√
n`log(`)
)
.
See the article by Radlinski et al. [2008] for more details.
6 By modifying the reward function one can also define an adversarial variant
of the document-based model. As in the previous note, the adversary secretly
choosesS 1 ,...,Snas subsets of [`], but now the reward is
Xt(a) =|St∩{a(1),...,a(k)}|.
The feedback is the positions of the clicked items,St∩{a(1),...,a(k)}. For this
model there are no computation issues. In fact, the problem can be analyzed
using a reduction to combinatorial semibandits, which we ask you to investigate
in Exercise 32.3.
7 The position-based model can also be modelled in the adversarial setting by
lettingStk⊂[`] for eacht∈[n] andk∈[m]. Then defining the reward by
Xt(a) =
∑m
k=1
I{At(k)∈Stk}.
Again, the feedback is the positions of the clicked items,{k∈[m] :At(k)∈Stk}.
This model can also be tackled using algorithms for combinatorial semibandits
(Exercise 32.4).